5

I'm looking for a function to compute a hash of a set.

It needs to satisfy two properties:

  1. If someone published a hash $h(S)$ of a set $S$ and a hash $h(S')$ of some subset $S' \subset S$ of it, I should not be able to compute the hash of $S \setminus S'$ from $h(S)$ and $h(S')$.

  2. From the hashes $h(S_1)$ and $h(S_2)$ of two disjoint sets $S_1$ and $S_2$, I should be able to compute the hash $h(S)$ of their union $S = S_1 \cup S_2$.

Does such a hash function exist?

Ilmari Karonen
  • 46,700
  • 5
  • 112
  • 189
Ishamael
  • 151
  • 3

3 Answers3

2

Here is an idea which mostly addresses what you appear to want; I'm not happy with it, however I thought I'd share it just in case.

Our 'hashes' are vectors of $n$ elements, each element being either a number between 0 and $p-1$, or the special symbol $\bot$.

The hash of a singleton element $s$ (that is, a set that consists of a single element) is computed by selecting the $n$ elements to be numbers deterministically (e.g. using the bits from SHAKE(s)), and then selecting $b$ of the $n$ indicies (again, deterministically), and setting each of those $b$ indicies to $\bot$.

And, to combine two hashes (using Ilmari's notation, to compute $c = a \otimes b$), you perform the following operation element-wise to the two hashes):

$c_i = \bot$ if $a_i = \bot$ or $b_i = \bot$

Otherwise, $c_i = a_i + b_i \bmod p$

It should be obvious that the above operator is both associative and commutative.

And, assuming $p, n, b$ are properly set, then given $a \otimes b$ and $a$, it should be highly probable that $b$ cannot be recovered (as there is likely be to some element $i$ in $a$ for which you have $a_i = \bot$ and $b_i \ne \bot$; in that case, you have no information on $b_i$, and so any of the $n$ possible values are equiprobable.

Of course, the hashes are quite large, and this noninvertability is only probable (and even that is true only if the sets $a, b$ are not too large).

poncho
  • 154,064
  • 12
  • 239
  • 382
2

This is not really an answer but a collection of observations.

Condition 1) seems a bit problematic and confusing. It seems that it would only hold if the size of $S$ is large enough independently of the $H(\cdot)$ used. The reason is if $n$, the size of $S$ is not large enough then an attacker for condition one could try the following(assuming we also know $S$):

  • let $P = 2^S$ be the powerset of $S$, i.e the set of all subsets of $S$
  • inputs: $S, H(S), H(S')$
  • let $A$ be an adversary for condition 1.
  • On every element $s \in P$, $A$ computes the hash $H(s)$ and checks if $H(s) = H(S')$
  • if check passes output $H(S\setminus s)$.

Now if the number of collisions is really small and the hash computation is assumed for free(efficient enough) then the complexity would be at most $O(2^N)$ with $N$ the size of $S$. Note: If we don't know S, we would also need that the number of maximal sets(sets that are not subsets of other sets) is large enough.

Furthermore we can actually get tighter bounds on the complexity assuming $S$ is known. If $S'$ has size $k$ then $S'$ has size $k'= N-k$. so the total number of steps would be $\sum^{k'}_{i = 1} \binom{n}{k}$. A closed formula is really welcomed! ;)

Finally what confuses me is that subsets of $S$ should be also in the domain of any such $H(\cdot)$, therefore this property would not hold for a small enough subset by the reasoning above.

Is it possible that such $H(\cdot)$ can never exits and satisfy condition 1 in general or am i missing something?

Marc Ilunga
  • 4,042
  • 1
  • 13
  • 24
-1

Let's formalize your encoding of the sets $S$ (assumed to be the subsets of some universal set $$\Omega=\{\omega_1,\ldots,\omega_n\}\leftrightarrow (1,\ldots,1)$$ as the string $$E(S)=(\mathbb{1}\{ w_k \in S\}: 1\leq k \leq n)$$ which has a $1$ in the $k^{th}$ position if and only if $w_k$ is an element of $S.$ So to hash a set you hash its encoding $E(S).$

Note that any good hash function should have the avalanche property; i.e., if a bit is flipped (say an element is added or removed from $S$ to obtain $S'$) the two hashes $h(E(S)),$ and $h(E(S'))$ should not be easy to determine from each other, the property you want should hold provided $n$ is large enough so that it can't be brute forced for a collision, say $n>2^{512}.$

If the universe is not so large, you may need to use some kind of salting to increase the strength of the hash function.

kodlu
  • 25,146
  • 2
  • 30
  • 63