Is it fair to assume that SHA1 collisions won't occur on a set of <100k strings

Question

I'm building a system that has to take file paths, and generate a unique name for each one. I'm planning on using SHA1 as the hash function. My question is: do I have to deal with possible collisions (2 different paths producing the same SHA1 value) or can I assume it won't occur?

score 31 · Accepted Answer · edited Feb 24 '17 at 06:04

The chance of a collision in such a set is approximately $ \frac{1/2 \cdot n^2}{2^{160}} $, which for n=100k evaluates to about $ 3.4 \cdot 10^{-39} $. So it is fair to say, such a collision won't occur accidentially.

~~AFAIK nobody has ever found a SHA-1 collision~~ The only concrete SHA1 collision to date was Google's on February 23^rd, 2017 (found here). Collisions become likely once you generate about $2^{80}$ or $10^{24}$ hashes.

If cryptoanalysis advances, an attacker might be able to create inputs that deliberately collide. Currently, however, there is no known way to do this efficiently. Of course, this only applies if your application needs to protect against deliberate collisions; many applications only require protection against accidental collisions. If you need protection against deliberate collisions, I'd prefer SHA-2 over SHA-1.

score 4 · Answer 2 · edited Aug 22 '21 at 15:22

Answer through experiment and observation.

i hashed:

all 216,553 words in the English language
uppercase form of all 216,553 words in the English language
216,553 type 1 ("sequential") uuids
216,553 type 4 ("random") uuids
all numbers from "1" to "216553"

In all 1,082,765 of those hashes, there were zero collisions.

This contrasts with some of the common non-cryptographic hash functions, that experience a dozen or so collisions (with a 32-bit hash, as opposed to SHA1's 160-bit hash), e.g. in Murmur2 hash:

cataract collides with periti
roquette collides with skivie
shawl collides with stormbound
dowlases collides with tramontane
cricketings collides with twanger
longans collides with whigs

A suggestion would be to construct a few billion random path strings, and see if you get any collisions. Although i can say (as long as you're dealing with items less than 2⁶⁴ bytes in length) that with SHA-1:

it is computationally infeasible to find two different messages which produce the same message digest

score 1 · Answer 3 · answered Apr 30 '15 at 22:43

If you want an absolute guarantee of no collisions, then use a cipher, not a hash. Encrypt the numbers 0, 1, 2, ... 99998, 99999, 100000 and the outputs are guaranteed to be unique for a given key. Convert to hex or Base64 for incorporation into a filename. Hasty Pudding cipher can be set for any desired range of numbers, or use DES for 64 bit numbers. You could even roll your own simple Feistel cipher with an 18 bit block size if security is less important than uniqueness.

score -2 · Answer 4 · answered May 11 '12 at 14:45

If you are not multi-threading, you could create unique file names by taking the current timestamp in nanoseconds and using that. Or you could use a millisecond-resolution timestamp and concatenate that with some quick hash. As a cryptographic hash the SHA-1 seems like an overkill if you use this method and you might get away with some simple CRC32 check sum, but if you want to be on the safe side you can use MD5 which is a cryptographic hash but faster than SHA1.

These methods kind of assume that the system clock is never turned back on accident or on purpose, or that the name generation process is only ran once.

Is it fair to assume that SHA1 collisions won't occur on a set of <100k strings

4 Answers4