Is there a consensus on what text encoding to use when hashing a string?

Question

Let's suppose we are hashing a string input from the user via keyboard. A string can be interpreted in many different text encodings (ANSI, UTF-8, UTF-16 and many others). Based on what encoding is used, the underlying binary data and therefore the hashes can be completely different, even though it seems like the same text was entered and the same algorithm was used.

However, I noticed that no matter how many hashing tools I try, hashes are always the same. Hashing apple with SHA-256 gives the same output everywhere.

Is there a consensus on what text encoding to use when hashing a string input? If there is, how wide-spread this consensus is? And what is the rationale behind the choice?

forest · Accepted Answer · 2018-07-21T06:53:10.383

Hashing is done on binary, not text strings. It is the job of the character encoding to say what binary representation a string will have. After you represent the text using a particular encoding, then you can hash it. The hash process itself does not care at all about encoding, only binary.

There is no consensus as to what the binary representation a given string will be in every circumstance (i.e. how apple will look in binary), but there is a consensus as to what the hash itself uses as input, which is binary. Most strings use the same encoding (ASCII, UTF-8, etc), and since an encoding is needed to convert a string into a binary representation, the most common encodings will be used when hashing said string. This is not universal, of course. If you enter a string on a Japanese system into a hash function, you may get a different result due to it using Shift JIS or UTF-16, for example. Simply put, there is a consensus in what text encoding to use when encoding text, but it does not specify how that string will be used (e.g. passed to a hash).

Note also that many different text encoding formats are redundant for many characters. ASCII characters are represented as the same string of bits when they are encoded instead as UTF-8. The biggest difference is that UTF-8 is capable of encoding characters that ASCII cannot. Because virtually everyone is using either ASCII or UTF-8, the string apple will be the same in all of them.

Mike Edward Moras · Answer 2 · 2018-07-22T15:56:56.987

The consensus is: a hash generally expects binary bits as input (practically, most implementations therefore handle it using binary bytes, aka 8-bit unsigned chars in the range 0x00-0xFF) and it will generally output binary bits (which most implementations output as a series of binary bytes, aka 8-bit unsigned chars) as well.

Now, since a hash generally does not handle any text encoding, this means you’ll practically have to convert things like UTF-16 (which is a multi-byte representation) and UTF-8 (which also partly contains multi-byte chars) to “raw binary bytes” accordingly so that you can feed the binary representation to the hashing function. Not doing this can and will result in glitches in your implementation, as one can frequently find in online tools which try to offer – for example – Javascript SHA-256 and alike hashing, but fail to produce the correct hash values due to wrong or missing conversion of the text encoding. You have to feed your hash a series of binary bits. Practically, the majority of implementations handles those input and output bits using 8-bit binary bytes (aka 8-bit unsigned chars, aka UINT8) as that’s what most modern-day devices work with natively/internally.

Same goes for the output. Any hex representation of a hash is the result of the hash output being converted from an set of bits (mostly implemented as an output of 8-bit unsigned chars) to their equivalent hex representation.

As for the programming part of your question: Hashes do not care what you hash... at all. How you process data before or after hashing is a purely programming-related question and out of the scope of this site.

As for text encoding to choose — that's entirely up to you and strictly depends on your individual scenario. From a programmatic point of view, I would point at UTF8, but one might equally argue that UTF-16 is the way to go, as it offers enhanced support for (eg) Asian languages. Also, it's what languages like Javascript and Java use internally. In the end, it depends on your project and how you want to handle your data... but that decision is — as I noted — not a cryptographically one.

TL;DR: Formally, almost all cryptographic hashes work on one or more binary bits. This is why hash functions like MD5, SHA-1, the SHA-2 family, and SHA-3 all take binary input, work binary internally, and produce a binary output. How you handle your (text) data before or after hashing it is up to you and your programming goals or the individual standards you follow (web standards mostly point to UTF-8 for example), but how you handle your strings in your program is not unrelated to cryptography.

score 2 · Answer 3 · edited Jul 22 '18 at 16:06

No, there is no consensus. It would be counterproductive anyways - hash algorithms like SHA* are supposed to work for any binary data, period.

If apple gives always the same, you didn't use different text encodings like UTF16, as simple as that.
With the countless more or less bad websites that offer you to hash something, there is no possibility to select any encoding for the text or to submit raw binary data, and most of them will use ASCII/ISO88591/UTF8 (apple is the same in all of them). Use something that actually hashes raw data, not some hobby website.

Is there a consensus on what text encoding to use when hashing a string?

3 Answers3

Linked