Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>which is probably really unique

Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.

>maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash

Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.

>if you just compare the bytes there is no chance of hash collision

You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.

But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.

You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.



> exceedingly rare

To have a mere one in a billion chance of getting a SHA-256 collision, you'd need to spend 160 million times more energy than the total annual energy production on our planet (and that's assuming our best bitcoin mining efficiency, actual file hashing needs way more energy).

The probability of a collision is so astronomically small, that if your computer ever observed a SHA-256 collision, it would certainly be due to a CPU or RAM failure (bit flips are within range of probabilities that actually happen).


You know, I've been hearing people warn of handling potential collisions for years and knew the odds were negligible, but never really delved into it in any practical sense.

Context is everything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: