Why is the sum of hashes not a proper homomorphic hash function?

Question

Let $H:X \to \{0,1\}^b$ denote a cryptographically secure, $b$-bits hash function on a set $X$. Let $H^∗:\mathcal P(X) \to \{0,1\}^b$ be a function on the power set of $X$ defined by $H^∗(\{x_1,…,x_n\}) = \Sigma_i H(x_i)$

Is see and understand that most variants of this construction using a modular domain for addition are flawed. I also see that these failures do not contradict the NP-completeness of the subset sum problem.

I am however not finding any arguments against the security of this scheme in the non modular case, where the sum is in the natural integer domain. At first glance, one would assume that any solution of the non modular variant that improves on the $\mathcal O(2^{n/2})$ bound can be turned into a an improved algorithm for the subset sum problem for random inputs.

Formally, the sum of hashes in the natural domain does not fit the definition of a hash function since the size of the output is variable. This is however easy to fix in practice by applying the hash function once more to the sum.

This is obviously related to previous questions on this forum, such as Is the sum of hashes a suitable hash for sets? and Are there any practical implementation of a homomorphic hashing or signature scheme?

Daniel S · Answer 1 · 2024-06-11T06:46:53.940

The issue here is that even if the distribution of $H$ is uniform (which is the ideal that we aim for), the distribution of $\sum H$ is not. Thus if $n=2$ the distribution is triangular and as $n$ grows the distribution tends towards $\mathcal N(n2^{b-1},O(n4^{b-1}))$ per the central limit theorem. From a security point of view then the collision entropy decreases hugely in comparison to the max-entropy (Hartley entropy).

Consider for example the case $n=1200$, which we can approximate as normally distributed with mean $1200\cdot 2^{b-1}$ and standard deviation $10\cdot 2^{b-1}$. Even naively, we may generate random collections of 1200 sets and evaluate our sum of hashes function of these random collections. We'll discard sets where the sum of hashes lies outside $[1140\cdot 2^{b-1},1260\cdot 2^{b-1}]$ which is a $\approx 2\sigma$ interval and allows us to keep $\approx 95\%$ of our sets. Our interval contains $80\cdot 2^{b-1}$ hash values, and even with a naive bound we would expect a collision from $n=\sqrt{40\cdot 2^{b-1}}/0.95\approx 6.65\times 2^{(b-1)/2}$ trials which is below the desired bound of $\sqrt{\pi\cdot 1200\cdot 2^{b-1}}\approx 108.8\times2^{(b-1)/2}$.

Philippe · Answer 2 · 2024-06-14T11:51:29.800

I believe this shows that the scheme is a lot less secure than the original hash function $H$, with a security parameter only proportional to the square root of the output size of $H$, in the ideal case. It is a relatively straight forward adaptation of Wagner2002. It shows that it is not, as I was hoping, the use of modular groups that make the simple additive methods ineffective. Non modular addition is just as vulnerable, it seems.

Let $\mathcal{Y}$ be the set of triplets of the form $(\mathcal{h},\mathcal{A},\mathcal{B})$ where:

$\mathcal{A}$, $\mathcal{B}$ are subsets of $X$
$H^*(\mathcal{A}) \ge H^*(\mathcal{B})$
$\mathcal{h} = H^*(\mathcal{A}) - H^*(\mathcal{B})$, a non negative integer

We define a combining binary operator $\oplus \colon \mathcal{Y} \times \mathcal{Y} \to \mathcal{Y}$. Given two elements of $\mathcal{Y}$, say $y_r = (\mathcal{h}_r,\mathcal{A}_r,\mathcal{B}_r)$ and $y_s = (\mathcal{h}_s,\mathcal{A}_s,\mathcal{B}_s)$, with $\mathcal{A}_r$, $\mathcal{B}_r$, $\mathcal{A}_s$ and $\mathcal{B}_s$ all disjoint:

if $\mathcal{h}_r < \mathcal{h}_s$, then $y_r \oplus y_s \to y_s \oplus y_r$
otherwise $y_r \oplus y_s \to (\mathcal{h}_r - \mathcal{h}_s, \mathcal{A}_r \cup \mathcal{B}_s, \mathcal{A}_s \cup \mathcal{B}_r)$

The idea is that the elements of $\mathcal{Y}$ are collision candidates for $H^*$, and the $\oplus$ operator combines them to form a new candidate.

We create a stream $S^0$ of elements from $\mathcal{Y}$ by enumerating elements of $X$ in some order, and setting the $i^{th}$ element of $S^0$ to $ S^0_i = (H(x_i), \{x_i\}, \emptyset)$.

We build a new stream $S^1$ by consuming elements of $S^0$ and storing them in a hash table indexed by the $l$ lower order bits of the hash difference. When a collision is found, say $S^0_i = (\mathcal{h}^0_i,\mathcal{A}^0_i,\mathcal{B}^0_i)$ and $S^0_j = (\mathcal{h}^0_j,\mathcal{A}^0_j,\mathcal{B}^0_j)$, where $\mathcal{h}^0_i$ coincides with $\mathcal{h}^0_j$ in their $l$ low-order bits, we output $S^0_i \oplus S^0_j$ and erase $S^0_i$ and $S^0_j$ from the hash table.

We build new streams, $S^k$, by consuming elements of $S^{k-1}$ and storing them in a new hash table indexed by the $k\cdot l$ lower order bits of the hash difference. When a collision is found, say $S^{k-1}_i$ coincides with $S^{k-1}_j$ in the $k\cdot l$ lower order bits, we output $S^{k-1}_i \oplus S^{k-1}_j$ and erase $S^{k-1}_i$ and $S^{k-1}_j$ from the hash table.

Note that if $k > 0$, $S^k$ is a stream of collision candidates, made of triples $(\mathcal{h},\mathcal{A},\mathcal{B})$ where $\mathcal{A}$ and $\mathcal{B}$ each have cardinality $2^{k-1}$.

If $k\cdot l \ge b + k$, $S^k$ is a stream of actual collisions.

To get the first real collision, you need to fill in all $k$ hash tables. This will consume on average about $2^{k+l}$ elements of $S^0$. We therefore chose $l = \lceil \sqrt{b} \rceil$ and $k = \lceil (b + k) / l \rceil$. The running time is $\mathcal O(2^{2\sqrt{b}})$ invocations of $H$.

Thereafter, to get the next collision, you need to consume an average of $2^k$ elements of $S^0$. The running time is only $\mathcal O(2^\sqrt{b})$, just enough time to calculate $H^*$ on the next colliding pair.

The storage requirement is $\mathcal O(\sqrt{b} 2^\sqrt{b})$.

[ I expect this is not very intelligible. Am willing to update and clarify. ]

Why is the sum of hashes not a proper homomorphic hash function?

2 Answers2