20

I sometimes run sha256sum on large files after transferring them from one place to another, and will just skim the hash output to verify it's correct. But, I usually just look at the first/last 5 or 6 hex digits and call it good enough.

I know that the chances of collision are something like 1 / 2^64, but what are the chances of a "near collision"? E.g. only one or two hex digits are different.


As a related topic, If you have a binary sequence and just change one bit, you get a completely different hash, correct? So is it possible for random errors to result in nearly identical hashes? I'm aware that MD5 is 'cracked' such that a malicious agent could append whatever necessary data to a file to make it have the same hash output- but is this within any reasonable realm of possibility to have happen stochastically?

EDIT: This topic has spawned some discussion below that was just as (if not more!) informative than the scope of my original question. With that said- I am referring only to using hash functions as a file integrity check, not as protection against attacks.

Paul
  • 303
  • 2
  • 6

5 Answers5

24

How many hex digits do I need to compare when manually checking hash functions?

If you actually want the full security guarantees of the hash function to apply: all of them.

I usually just look at the first/last 5 or 6 hex digits and call it good enough.

This effectively reduces the security of the hash function to that of one that only outputs 10-12 hex digits, a.k.a. 40-48 bit, for which we can easily find collisions in about $2^{24}$ evaluations which is easily doable.

what are the chances of a "near collision"? E.g. only one or two hex digits are different.

As discussed in this answer the probability for at most $t$ bits difference between two hashes with $n$ bit output-length is $$p_t=\sum^t_{k=0}{n\choose k}2^{-n}$$ which means you need about $\sqrt{1/p_t}$ evaluations to come up with a desired near-collision. Strategies to finding these values can be found in this answer (including low-memory approaches).

SEJPM
  • 46,697
  • 9
  • 103
  • 214
15

If you control both ends as well as the transfer channel - for example if you are transferring a large file between two of your own computers via a USB drive - then it's OK to only verify the hash superficially. In fact as you are just checking integrity, you wouldn't even need a cryptographically secure hash function, a CRC should suffice.

If you fear that the file might be tampered with on the way, then you need to check the whole hash. You didn't say which operating system you were on, but it should be easy to automate the comparison after you compute the hashes.

It is not feasible for the attacker to create a hash that look almost the same as the original, but it is quite feasible for a motivated attacker to match the first 8 characters or more. Consider that this is essentially equivalent to the computation performed by Bitcoin mining, so the special software and hardware developed for this could be reused to create hashes of known prefixes.

is it possible for random errors to result in nearly identical hashes

No. The chances of this are infinitesimal.

Yolanda Ruiz
  • 578
  • 2
  • 9
5

The answer depends on the goal, how the file was prepared, and on if the position of the cecled digit is known to adversaries

  • If you only fear that there was a random error in transfer, checking each hexadecimal digit divides odds of undetected error by 16 (24), thus checking 5 hex digits leaves probability 2-5×4 = 2-20 (less than one chance in a million) of undetected error, which might be good enough.
    Checking 5 digits at each extremity gives 2-40 (one chance in a million million, thus manually checking by this procedure once per second for 8 hours per day for 100 years with all files in error leaves less than one chance in a thousand to ever miss an error).
  • If you fear that an adversary has intentionally altered the file in transfer, but you have prepared the file, then checking each hexadecimal digit makes attack 16 times (24 factor) as hard for the adversary. For 100-bit security (commensurate with the effort wasted on bitcoin mining so far) you want to check at least 100/4 = 25 out of the 64 digits.
    If the adversary knows that 6 digits are checked at each extremity, it takes only about 26×2×4 = 248 hashes to find a file variant of a know file that passes the test; that's significant but feasible work.
  • If you fear that an adversary has intentionally altered the file in transfer, and do not know how the original file was made, then the best insurance is that checking each hexadecimal digit makes the attack at least 4 times (22 factor) as hard for the adversary (the lower assurance is because of the birthday problem). For 100-bit security you want to check at least 100/2 = 50 out of the 64 digits.
    If the adversary knows that 6 digits are checked at each extremity, it takes only about 26×2×2+1 = 225 hashes to find two files which (different) SHA-256 hashes will pass your test (by the method of Paul C. van Oorschot and Michael J. Wiener, Parallel Collision Search with Cryptanalytic Applications, in Journal of Cryptology, 1999; that's very easy.
  • As noted in that other answer, verifying a handful of digits at positions chosen at random gives pretty good security even against powerful attackers.
fgrieu
  • 149,326
  • 13
  • 324
  • 622
4

As SEJPM♦ has answered, for full security - all of the digits.

However, the "answer" to

I usually just look at the first/last 5 or 6 hex digits and call it good enough.

will be dependent on the threat type. If you're worried about a file which you have complete control over being accidentally corrupted, then AFAIK, if you're checking the 6 last digit, you have a 1 in 16 million chance of missing a corruption. However, if the threat is a malicious switch of the file, it might be quite higher. An attacker can replace the original file with one with malware (in case of an executable) and just add bytes at the end trying different values and combinations again and again until they get the required last 6 digits, assuming you are not the only one checking only those digits. In theory, assuming hash functions are random in this sense, an attacker is expected to find an appropriate hash after trying 16 million times. That can take a couple of seconds for a short file on a regular GPU. If it's done with an ASIC - even less.

But this shouldn't be a practical question. Good apps for verifying checksum hashes compare the digits for you.

ispiro
  • 2,085
  • 2
  • 18
  • 29
3

As many as you want.

According to your threat model. If your threat model is your neighbour's son messing with your wifi, maybe a handfull of digits would suffice.

If you dowbloaded the file over ssl with a valid certificate from a trusted source, zero hash bits might be a reasonable number to verify. Alternatively if the file and hash come from the same source over same medium the verification is futile.

If you fear a competent attacker messing with the message you should verify random octets, not prefix/suffix. Getting the first/last few is feasable depending on attacker computing power. However gettting a high percentage is beyond anyone's ability. So verifying a handful of octets gives pretty good security even against powerful attackers. I try to remember I'm lousy at picking random octets and naturally verify a few in a row so a bit more octets may be required for high confidence.

Meir Maor
  • 12,053
  • 1
  • 24
  • 55