Why does the padding in Merkle–Damgård hash functions like MD5 contain the message length?

Question

I understand the need for padding in MD5 and other hash algorithms such as SHA-1, SHA-256, SHA-384 and SHA-512. But why do we append the message length to the padding? I heard it strengthens the hash but how?

Please provide an example if possible and how it applies to this quote:

The inclusion of the length-block effectively encodes all messages such that no encoded input is the tail end of any other encoded input.

Why is it bad for a message to be identical to the tail of another message?

Seth · Accepted Answer · 2011-12-11T04:22:39.583

MD5, like other hash functions, uses the Merkle-Damgard construction. You take the message and break it up into fixed-size blocks. You start with an intialization vector (IV), which you feed into a compression function along with the first block. Take the output (it will be the same length as the IV), and feed it into the compression function along with the second message block, and so on.

Call the compression function $c$ and the "raw" (unpadded) hash function $h$, we have $$ \begin{align*} h(ZABCD) &= c(h(ZABC), D) \\&= c(c(h(ZAB), C), D) \\&= c(c(c(h(ZA), B), C), D) \\&= c(c(c(c(h(Z), A), B), C), D) \\&= c(c(c(c(c(I,Z), A), B), C), D)\end{align*}$$ with a fixed initialization vector $I$ (each letter represents a block).

Different hash functions use different compression functions. The goal is to prove that we can find a collision in the hash function only if we can find a collision in the compression function (the next step would be to argue that finding a collision in the compression function is extremely difficult).

How would this argument work? We could restate our goal as follows:

If you give me two messages that cause a collision in the hash function, I can find two pairs of inputs for the compression function that cause a collision.

Here's how I do meet this goal: I take the two messages you give me, and split them into blocks. Then I find the right-most block that the two messages don't have in common. For example, if the first message (after padding) has blocks $ZABCD$ and the second has blocks $XKCD$, then third block from the right is different ($B$ vs. $K$).

Since $h(ZABCD) = h(XKCD)$, I check to see if $h(ZABC) = h(XKC)$. If not, I have my collision:

$$ h(ZABCD) = c(h(ZABC), D) = c(h(XKC), D) = h(XKCD). $$

On the other hand, if $h(ZABC) = h(XKC)$, this isn't really a collision for $c$, since the input values are the same. So I go back a block and see if $h(ZAB) = h(XK)$, and do the same thing. If they are not equal, I have a collision: $c(h(ZAB), C) = c(h(XK), C)$. If they are equal, I go back another block.

But now that $B \neq K$, I don't need to worry about whether or not $h(ZA) = h(X)$. I know that $c(h(ZA), B) = c(h(X), K)$ (because $h(ZAB) = h(XK))$, and since $B$ and $K$ are distinct, this is a collision – different inputs, same output.

I can go through this processes for any pair of inputs that generate a hash function collision in order to find a pair of inputs that cause a compression function collision. QED, right?

But wait!
What if when I go from right-to-left, I run out of blocks before I find two that are different? Then the argument breaks. So for the argument to work, I need my padding scheme to ensure that one (padded) message is never the tail end of another.

fgrieu · Answer 2 · 2015-08-24T01:09:25.840

The Merkle–Damgård hash construction customarily pads the message $M$ to be hashed with a single bit set to 1, a minimal number of bit(s) set to 0, and the representation of the length of the message in binary over some fixed number of bits. The padded message is then formed of a number of blocks $B_i$. The hash is computed by repeatedly applying a compression function $C$: $(X,B)\rightarrow C(X,B)$, giving $$H(M)=C(C(..C(C(IV,B_0),B_1)..B_{k-2}),B_{k-1})$$

Question is: Why pad with the length? Is not that redundant with the bit(s) added before?

Having the length in the padding allows a simple security proof: full hash collision between two known distinct messages allows exhibiting a collision in the underlying compression function $C$, that is exhibiting $X,B,X',B'$ with $C(X,B)=C(X',B')$ and $(X,B)\ne (X',B')$.

The proof sketch goes:

if the distinct colliding messages are of different length, then a collision in $C$ occurs in the last block: we know $B_{k-1},B'_{k-1}$ are distinct since they contain the length;
if the distinct colliding messages are of the same length, then there is a well defined rightmost block where the padded messages differ, and when we scan for a collision in $C$ from the right of the padded message up to that block, we exhibit a collision.

Without the length in the padding, we can still have a security proof, but with stronger hypothesis on $C$: full hash collision between two known distinct messages allows exhibiting either a collision in $C$ as above, or exhibiting a pre-image for the $IV$ constant given in the definition of the hash, that is exhibiting $X,B$ with $C(X,B)=IV$. The proof is more complex: we must subdivide case 1 depending on if the messages have the same padded length, or not; in the negative further subdivide depending on if the shortest padded message matches the end of the longest one, or not; in the affirmative, we do not get a collision, but rather a preimage of $IV$.

This later proof is quite tight. Without length in the padding, and with the IV replaced by a constant of unknown origin, an explicit attack is possible, as follows. Consider a hash variant similar to MD5 or SHA-256, with these two differences:

length is removed from the padding (which becomes: a single bit set to 1, then just enough bit(s) set to 0 in order to fill the final 512-bit block);
the $IV$ is some random-looking value (rather than made from increasing hex digits in MD5, or the fractional part of the square root of the first 8 primes in SHA-256).

The compression function $C$ used in MD5 and SHA-256 is of the form $$(X,K)\rightarrow C(X,K)=X\tilde+E(X,K)$$ where $\tilde+$ is a variant of addition (with a few carries suppressed), and $X\rightarrow E(X,K)$ is a reversible block cipher with key $K$.

Who defined the $IV$ could have chosen a secret 512-bit $K$, computed $IV = E^{-1}(0,K)$, which insures $IV=C(IV,K)$, and thus allows insertion of $K$ at the beginning of any message, without changing the hash.

Summary: the length allows a simpler and tighter security proof. As far as I can tell, it is not indispensable, provided that the compression function $C$ is preimage-resistant, and the $IV$ is chosen without reference to $C$.

[This new answer has nothing to do with my earlier attempt, and is strongly inspired by Seth's earlier answer]

Jalaj · Answer 3 · 2011-12-13T18:19:36.610

There is a beautiful characterization for the collision preserving padding rule of any Merkle–Damgård-construction: the padding rule should be suffix free. See the 2009 paper Characterizing Padding Rules of MD Hash Functions Preserving Collision Security by Mridul Nandi for more details.

The length of the message, as it turns out to be, is the simplest form of padding which is suffix free. The proof in the paper is self contained and very easy to understand, so I am not putting forward the sketch of the proof.

Why does the padding in Merkle–Damgård hash functions like MD5 contain the message length?

3 Answers3

Linked

Related