12

I'm building a system that has to take file paths, and generate a unique name for each one. I'm planning on using SHA1 as the hash function. My question is: do I have to deal with possible collisions (2 different paths producing the same SHA1 value) or can I assume it won't occur?

Denis Hennessy
  • 223
  • 1
  • 2
  • 6

4 Answers4

31

The chance of a collision in such a set is approximately $ \frac{1/2 \cdot n^2}{2^{160}} $, which for n=100k evaluates to about $ 3.4 \cdot 10^{-39} $. So it is fair to say, such a collision won't occur accidentially.

AFAIK nobody has ever found a SHA-1 collision The only concrete SHA1 collision to date was Google's on February 23rd, 2017 (found here). Collisions become likely once you generate about $2^{80}$ or $10^{24}$ hashes.

If cryptoanalysis advances, an attacker might be able to create inputs that deliberately collide. Currently, however, there is no known way to do this efficiently. Of course, this only applies if your application needs to protect against deliberate collisions; many applications only require protection against accidental collisions. If you need protection against deliberate collisions, I'd prefer SHA-2 over SHA-1.

657784512
  • 103
  • 4
CodesInChaos
  • 25,121
  • 2
  • 90
  • 129
4

Answer through experiment and observation.

i hashed:

In all 1,082,765 of those hashes, there were zero collisions.

This contrasts with some of the common non-cryptographic hash functions, that experience a dozen or so collisions (with a 32-bit hash, as opposed to SHA1's 160-bit hash), e.g. in Murmur2 hash:

  • cataract collides with periti
  • roquette collides with skivie
  • shawl collides with stormbound
  • dowlases collides with tramontane
  • cricketings collides with twanger
  • longans collides with whigs

A suggestion would be to construct a few billion random path strings, and see if you get any collisions. Although i can say (as long as you're dealing with items less than 264 bytes in length) that with SHA-1:

it is computationally infeasible to find two different messages which produce the same message digest

Ian Boyd
  • 1,041
  • 13
  • 16
1

If you want an absolute guarantee of no collisions, then use a cipher, not a hash. Encrypt the numbers 0, 1, 2, ... 99998, 99999, 100000 and the outputs are guaranteed to be unique for a given key. Convert to hex or Base64 for incorporation into a filename. Hasty Pudding cipher can be set for any desired range of numbers, or use DES for 64 bit numbers. You could even roll your own simple Feistel cipher with an 18 bit block size if security is less important than uniqueness.

rossum
  • 769
  • 4
  • 12
-2

If you are not multi-threading, you could create unique file names by taking the current timestamp in nanoseconds and using that. Or you could use a millisecond-resolution timestamp and concatenate that with some quick hash. As a cryptographic hash the SHA-1 seems like an overkill if you use this method and you might get away with some simple CRC32 check sum, but if you want to be on the safe side you can use MD5 which is a cryptographic hash but faster than SHA1.

These methods kind of assume that the system clock is never turned back on accident or on purpose, or that the name generation process is only ran once.

ZeroOne
  • 127
  • 4