finding encoding based collision

Question

For example, If I enter as a sample "hello world" in English, then the hashing protocol by encoding the text to binary strings makes the hash.
What if I find a word in other languages like Hindi or Chinees that has the same result as the English word, Then we have the same hash?

How is finding encoding based collision?

SEJPM · Accepted Answer · 2018-09-08T15:28:30.763

how hashing protocol generate a hash of it?

Hash functions take strings of bits. They don't care what meaning these string have to the user, so it doesn't care for the fact whether these bits could be interpreted to mean some hindi word or some english sentence.

This also means that you need some mapping (called "encoding") of the text to binary strings.

ASCII is the classic one, but it only maps about 100 characters to bytes, it also fails on e.g. german, french or hindi texts.
UTF-8 is the modern classic one. It uses 1 byte for ASCII characters and as many bytes as needed to represent larger Unicode code points.
UTF-16 is similar to UTF-8 but defaults to using 2 bytes for essentially all practically relevant characters / code-points and 4 bytes for the rest.
UCS-2 is like UTF-16 but instead of sometimes being 2 and sometimes 4 bytes, it drops support for the higher code points and is always 2 bytes.
UTF-32 / UCS-4 always uses 4 bytes and can represent any Unicode character.

You may want to note that Unicode is a standard that essentially assigns numbers (="code points") to character symbols. It doesn't by itself dictate how these numbers are then encoded into binary.

You may also want to note that different system (operating systems, programming languages and database systems and other textprocessing systems) may have different defaults on the encoding.

So it possible that we can have the same Hash result that has been generated from the different text?

Assuming your hash function isn't completely bad, like e.g. MD5, then this is only possible if you use different encodings (e.g. those pre-dating unicode) for the different languages and they somehow generate the same byte sequence.

finding encoding based collision

1 Answers1