What is the reason to separate domains in the internal hash algorithm of a merkle tree hash?

Question

From rfc 6962 It is stated that:

Note that the hash calculations for leaves and nodes differ. This domain separation is required to give second preimage resistance.

That means that whenever the hash computes on leaves a distinct known element is preappend to the element $e$: $$H(0\mathbin\|e)$$ and whenever hash applied to parent nodes for leaves $h_0=H(0\mathbin\|e_0)$ $h_1=H(0\mathbin\|e_1)$ then $1$ is being put in the beggining:

$$h_2=H(1\mathbin\|h_0\mathbin\|h_1)$$

It is not clear what is the security implication if $0,1$ is not appended to the hash to separate the two domains. The authors state that this happens to prevent second-preimage attacks. But from Merkle tree hash we require from the hash function $H$ to be collision-resistant

score 9 · Accepted Answer · answered Jan 31 '17 at 00:01

The document you refer to describes a method for hashing lists of data entries. Assume you do not prepend $0$ or $1$. Then, the hash for the list $(e_1, e_2)$ is $H(h_1 \| h_2)$ for $h_1 = H(e_1)$ and $h_2 = H(e_2)$. It is now easy to find a second preimage of that, namely the "list" with the single entry $h_1 \| h_2$, which will be hashed to $H(h_1 \| h_2)$.

This attack does not work if you pretend $0$s and $1$s as suggested: The hash of the list $(e_1, e_2)$ is then $H(1 \| h_0 \| h_1)$ for $h_0 = H(0 \| e_0)$ and $h_1 = H(0 \| e_1)$. You now cannot easily find a preimage of that: The single entry list $h_1 \| h_2$ gets hashed to $H(0 \| h_0 \| h_1)$, and $1 \| h_0 \| h_1$ gets hashed to $H(0 \| 1 \| h_0 \| h_1)$.

score 4 · Answer 2 · answered Jan 30 '17 at 22:35

I believe that the issue is not what we normally call a second preimage attack on the hash function, but is actually a forgery attack on the system.

Suppose that the leaf hash was $H(e)$, and that the Merkle node hash was $h_2 = H(h_0 || h_1)$.

In that case, if we see a valid signature that involves a Merkle node computation $h_2 = H(h_0 || h_1)$, we can immediately generate a signature for the message $h_0 || h_1$ (as $h_2$ is the leaf hash for that message, and we can just copy the rest of the authentication path.

While $h_0 || h_1$ might not be an interesting message to forge, it is nevertheless a good idea to eliminate that possibility anyways.

score 2 · Answer 3 · 2017-03-23T06:59:29.910

A contextual explanation is typically easier to understand with this attack.

Suppose the underlying data structure being hashed was User data, with a respective firstName, middleName, lastName and age.

The resultant merkle-tree might look something like:

$$ root = H(a || b) \\ \overbrace{a = H(c || d), b = H(e || f)} \\ \overbrace{c = H(first), d = H(middle)} \overbrace{e = H(last), f = H(age)} $$

If your verification algorithm/data is unbounded, an attacker could omit $age$ and trick the verifier into thinking $f = H(age)$ is the User's actual age. This attack wouldn't make much sense against a firstName, as the binary data of $c = H(firstName)$ likely isn't human-readable.

However, if age de-serialized in such a way that any remaining bytes were discarded, an attacker could find a value that is plausible, and still verifiable.

A mitigation against this is to either use an alternative hash algorithm for leaf nodes such that:

$$H(leaf) \ne H'(leaf)$$

This can also be done by defining $H^L = H(0 || leaf)$ and $H^I = H(1 || leaf)$, where $H^L$ is the leaf hash, and $H^I$ is the inner hash.

What is the reason to separate domains in the internal hash algorithm of a merkle tree hash?

3 Answers3

Linked