I'm building a system that has to take file paths, and generate a unique name for each one. I'm planning on using SHA1 as the hash function. My question is: do I have to deal with possible collisions (2 different paths producing the same SHA1 value) or can I assume it won't occur?
4 Answers
The chance of a collision in such a set is approximately $ \frac{1/2 \cdot n^2}{2^{160}} $, which for n=100k evaluates to about $ 3.4 \cdot 10^{-39} $. So it is fair to say, such a collision won't occur accidentially.
AFAIK nobody has ever found a SHA-1 collision The only concrete SHA1 collision to date was Google's on February 23rd, 2017 (found here). Collisions become likely once you generate about $2^{80}$ or $10^{24}$ hashes.
If cryptoanalysis advances, an attacker might be able to create inputs that deliberately collide. Currently, however, there is no known way to do this efficiently. Of course, this only applies if your application needs to protect against deliberate collisions; many applications only require protection against accidental collisions. If you need protection against deliberate collisions, I'd prefer SHA-2 over SHA-1.
- 103
- 4
- 25,121
- 2
- 90
- 129
Answer through experiment and observation.
i hashed:
- all 216,553 words in the English language
- uppercase form of all 216,553 words in the English language
- 216,553 type 1 ("sequential") uuids
- 216,553 type 4 ("random") uuids
- all numbers from
"1"to"216553"
In all 1,082,765 of those hashes, there were zero collisions.
This contrasts with some of the common non-cryptographic hash functions, that experience a dozen or so collisions (with a 32-bit hash, as opposed to SHA1's 160-bit hash), e.g. in Murmur2 hash:
cataractcollides withperitiroquettecollides withskivieshawlcollides withstormbounddowlasescollides withtramontanecricketingscollides withtwangerlonganscollides withwhigs
A suggestion would be to construct a few billion random path strings, and see if you get any collisions. Although i can say (as long as you're dealing with items less than 264 bytes in length) that with SHA-1:
it is computationally infeasible to find two different messages which produce the same message digest
If you want an absolute guarantee of no collisions, then use a cipher, not a hash. Encrypt the numbers 0, 1, 2, ... 99998, 99999, 100000 and the outputs are guaranteed to be unique for a given key. Convert to hex or Base64 for incorporation into a filename. Hasty Pudding cipher can be set for any desired range of numbers, or use DES for 64 bit numbers. You could even roll your own simple Feistel cipher with an 18 bit block size if security is less important than uniqueness.
- 769
- 4
- 12
If you are not multi-threading, you could create unique file names by taking the current timestamp in nanoseconds and using that. Or you could use a millisecond-resolution timestamp and concatenate that with some quick hash. As a cryptographic hash the SHA-1 seems like an overkill if you use this method and you might get away with some simple CRC32 check sum, but if you want to be on the safe side you can use MD5 which is a cryptographic hash but faster than SHA1.
These methods kind of assume that the system clock is never turned back on accident or on purpose, or that the name generation process is only ran once.
- 127
- 4