Checking encoded strings for a hash collision in Python

Question

There is a common term used in cryptography called a hash collision. If I am reading the definition correctly on Wikipedia, this can occur if two different data values give rise to the same hash value.

Duplicate hash, different input:

text1 encoded = hash1 
text2 encoded = hash1

The first code block is a binary value with a hash obtained from the digest() function, which I found on a website. The section code block is what I modified, which is what I'm understanding is a hash collision. Notice that the second code block is checking if the hash is a duplicate but the original string is different.

Can anyone explain if my second code block is a hash collision and if not, why? And explain how the first and second code blocks differ in terms of the definition.

https://www.learnpythonwithrune.org/birthday-paradox-and-hash-function-collisions-by-example/

Code Block #1:

import hashlib
import os
collision = 0
for _ in range(1000):
    lookup_table = {}
    for _ in range(16):
        random_binary = os.urandom(16)
        result = hashlib.md5(random_binary).digest()
        result = result[:1]
        if result not in lookup_table:
            lookup_table[result] = random_binary
        else:
            collision += 1
            break
print("Number of collisions:", collision, "out of", 1000)

Code Block #2:

Codes 0 through 31 and 127 (decimal) are unprintable control characters. Code 32 (decimal) is a nonprinting spacing character. Codes 33 through 126 (decimal) are printable graphic characters.

string.ascii_lowercase + string.ascii_uppercase + string.ascii_letters + string.digits + string.punctuation + string.whitespace + string.printable

import hashlib
import os
import random
import string
collision = 0
total_attempts = 100000
lookup_table = {}
for _ in range(total_attempts):
str = ''.join(random.choice(string.printable) for i in range(3))
str_encode = str.encode('utf-8')
hash = hashlib.md5(str_encode).hexdigest()
hash = hash[:3]

if hash in lookup_table:
    if str not in lookup_table[hash]: # hash is the same; string is different
        collision += 1 
        print(lookup_table[hash] + '_' + hash)
        lookup_table[hash] = lookup_table[hash] + ';' + str
else:
    lookup_table[hash] = ';' + str    

print("Number of collisions:", collision, "out of", total_attempts)

fgrieu · Accepted Answer · 2023-03-22T08:59:16.247

A hash collision is the circumstance in which two distinct inputs of a hash function have the same hash. A colliding pair is two such distinct inputs. For examples of MD5 collisions and colliding pairs, see there.

The first code block finds and counts events that are mostly hash collisions for MD5 restricted to it's first byte, but could also be accidental collisions between the values returned by os.urandom(16) (the later are so extremely rare that it's negligible in the result). The experiment is restricted to collisions detected among 16 inputs, and is repeated 1000 times. The code seems written to illustrate a variant of the usual birthday problem.

The second code block (in version 7) can't find any hash collision, for two independent reasons:

It's attempted to find MD5 collisions in a naive way: by hashing random strings, but (by the aforementioned birthday problem) we would likely need in the order of $2^{64}$ hashes to find one MD5 collision in this way, and we only do $10^5=2^{16.6\ldots}$. Hint: shorten the hash (not the input of the hash) to it's first three bytes so that there are collisions to observe.
lookup_table never contains more than one element, for it's reset inside the loop; FIX THAT!

The code distinguish hash collisions from accidental collisions in the hashed strings.

Note: one simple way to avoid collisions among the inputs of the hash is to generate the inputs incrementally. This allows to simplify the code.

Checking encoded strings for a hash collision in Python

1 Answers1