12

What are the differences between checksums (e.g. Fletcher, Adler, CRC), non-cryptographic hashes (e.g. xxHash, MurmurHash, CityHash) and cryptographic hashes (e.g. MD5, SHA1, SHA3)?

I am familiar with checksums and how they're used to detect errors in data, and how the design can influence collisions (like [0,0] [0,0,0] would give the same checksum in many simple algorithms, even Fletcher).

I also have a basic understanding of how cryptographic hash functions work. They're more taxing to compute, produce a more randomized output, and generally have a larger output size than a checksum. And somehow they're designed to be secure.

However, non-cryptographic functions are a mystery to me. None are really taken seriously or standardized. We have CRC32 being implemented in hardware, huge competitions for the new SHA3 standard, etc. But non-cryptographic functions appear to be just curiosities made independently. And Google's CityHash is apparently made specifically for strings.

If hash tables are solely the purpose of non-cryptographic hashes (and strings, as CityHash advertises), then are they not appropriate for error-detection in large binary data files that SHA1 and CRC32 are often employed for?

I have noticed striking similarities in the source code of simple algorithms such as DJB2 and Adler32, for example. They all appear to accomplish the same thing.

I've used MD5 in the past to find duplicate files, errors in file downloads, and to identify minor differences in otherwise identical files. MD5 is also supposed to be not that fast, with CRC32 achieving the same thing, although with potentially more collisions (which bothers me a bit).

As for my ideal use case, I would prefer an algorithm like MD5 (or shorter output length) that is as fast as possible (no security considerations required), with as few collisions as possible. It would be used as a fast checksum/file fingerprint. Recently, I have found the 128-bit version of MurmurHash3 (specifically the x86 version for JS implementation) to fit these requirements. But again, MurmurHash is not a checksum--it's for hash tables, so I don't know if this would be a misuse.

Can anyone explain the differences, or refer me to any books/articles on the subject?

bryc
  • 292
  • 4
  • 15

2 Answers2

7

Cryptographic functions are designed to survive some adversarial setting; their designs assume that there will be very clever people trying as hard as they can to "fool" them. Non-cryptographic functions on the other hand are "optimistic"—they're designed to perform well with the sort of data that their designers expect will be given to them.

In fact, non-cryptographic functions are often observed to synergize with data sets that exhibit the expected patterns—for example, note how the functions tested in this Programmer's Stack Exchange answer have lower-than-chance collisions for the consecutive numbers data set. This behavior is often by design.

But non-cryptographic appear to be just curiosities made independently. And Google's CityHash is apparently made specifically for strings. If Hash tables are solely the purpose of non-cryptographic hashes (and strings, as CityHash advertises), then are they not appropriate for error-detection in large binary data files that SHA1 and CRC32 are often employed for?

The answer to this is that there's no general answer other than whatever the designers of the functions set themselves as their goals. With cryptographic hash functions, the adversarial setting dictates the standard that functions must meet, but once you take that away authors can just do whatever they think is most appropriate.

It's worth noting however than many of the modern non-crypto hash functions are tested using the SMHasher test suite, so lately you tend to see lots of hash functions that score well at it. Here's a brief description of its tests.

Luis Casillas
  • 14,703
  • 2
  • 33
  • 53
0

My understanding of a checksum is that it's optimised for detection of changes to the input, whether intentional or accidental: That is, it's important that you can't arbitrarily change the input and leave the checksum the same, but it's not so important that coincidental collisions may occur with completely different inputs.

An example: Suppose you have some software code which outputs "Hello world", and has a checksum value of XYZ0011. You might be able to find another string which produces the same checksum value, but it won't be valid code. You certainly can't make the code output "Goodbye world" without completely changing the checksum value.

Beejamin
  • 101
  • 2