43

I would like to maintain a list of unique data blocks (up to 1MiB in size), using the SHA-256 hash of the block as the key in the index. Obviously there is a chance of hash collisions, so what is the best way of reducing that risk? If I also calculate the (e.g.) MD-5 hash of the block, and use the combination (SHA-256, MD-5) as the key, is the chance of a collision about the same as some 384-bit hash function, or is it a little bit better because I'm using different hash functions?

Thanks for the info!

Edit: My blocks come from normal user data on hard drives, but it will be many petabytes in total.

Edit2: As a follow-up (just tell me if this should be moved to a different question): Since the blocks can vary in size but can be up to some preconfigured limit (e.g. 1MiB), how will collision resistance be affected if I make the (64-bit) size of the block part of the key? That way you can only have collisions of blocks with the same size...

Theodor Kleynhans
  • 555
  • 1
  • 5
  • 6

4 Answers4

90

The risk of collision is only theoretical; it will not happen in practice. Time spent worrying about such a risk of collision is time wasted. Consider that even if you have $2^{90}$ 1MB blocks (that's a billion of billions of billions of blocks -- stored on 1TB hard disks, the disks would make a pile as large of the USA and several kilometers high), risks of having a collision are lower than $2^{-76}$. On the other hand, the risks of being mauled by a gorilla escaped from a zoo are at least $2^{-60}$ per day, i.e. 65000 times more probable than the SHA-256 collision over way more blocks than possibly makes sense. Stated otherwise, before hitting a single collision, you can expect the visit from 65000 successive murderous gorillas. So if you know what's good for you, drop that MD5 and go buy a shotgun.

SHA-256 collisions are not scary; gorillas are.

Now for the suggestion of concatenating the outputs of two distinct hash functions, say SHA-256 and MD5. It turns out that this does not enhance security as much as one could believe. The total size of 384 bits would certainly not provide more security against collisions that what a 384-bit hash function would give; but it actually is much weaker than that: it would not be really much stronger than SHA-256 alone. See this previous question, and this research article for the gory details. This can be summed up as follows: when using several hash functions in parallel and concatenating the outputs, the total is not stronger against collisions than the strongest of the individual functions.

And, of course, MD5 itself is weak against collisions and as such should not be envisioned for newer designs.

Morrolan
  • 1,176
  • 8
  • 19
Thomas Pornin
  • 88,324
  • 16
  • 246
  • 315
13

The risk of collision is only theoretical; it will not happen in practice.

Except in one particular instance. The description given implies that this system is going to be some form of de-duplicating filesystem or backup system. For most users, the collision risk is tiny.

But, for one particular class of users, there is a much larger risk. Those users are cryptographic hash researchers for whom one could presume that hash collisions within their HD's data content are more likely than the average joe, simply because they are attempting to manufacture such collisions.

Therefore, if this is to be a de-duplicating filesystem or backup system, and a cryptographic hash researcher makes use of it, the risk of two different data blocks having a colliding hash is larger than for the average joe.

Anon
  • 139
  • 1
  • 2
7

To have approximately a 50% chance of a collision, you'd need $2^{128}$ data blocks. This comes from the birthday problem. Are you anticipating your list to be that large? I would doubt it as that would be an astronomical amount of data (much, much more than a petabyte).

That said, it is very, very unlikely that a collision for MD5 would also be a collision for SHA-256, so you would probably be fine doing the dual hash thing, but why not just use SHA-384 (or SHA-512) if you are that worried about a collision.

mikeazo
  • 39,117
  • 9
  • 118
  • 183
6

The risk of collision in practically non-existent, but as a good software developer write your code to handle it:

If hashes are equal then compare block lenghts, if they are equal then compare blocks byte by byte, and if they differ or if lengths are different then 1) increase an integer counter concatenated at the end of the hash ID (it should be 0 everywhere else), 2) LOG THE COLLISION LOUDLY, 3) profit.

The CPU-intensive part is the comparison but don't worry, it shall happen only in the case of duplicates, and even then comparing bytes should be lightweight. Test the code by choosing CRC32 as the hash function.

EDIT: Don't underestimate cryptology research, nobody can guarantee that in 5 years it won't be feasible to find a collision, so protect yourself against malicious users as well as against gorillas.

jimis
  • 61
  • 1
  • 2