Hash Output Distribution for Partly Randomised Input

Question

For a good (unkeyed) hash function, if a part of the input is random, is it fair to assume that the output will also be randomised? More specifically, suppose $H$ has an $n$-bit output, and we write its input as $x||y$ where $y$ has $n$-bits and $x$ is of arbitrary length. If $y$ is sampled uniformly at random from $\{0,1\}^n$, can we argue that (for some fixed $x$) $H(x||y)$ will also be uniformly distributed over $\{0,1\}^n$ (up to some small error term)? It seems this should informally follow from things like avalanche effect, but I'm not sure how we can argue this, without any idealised assumptions on the hash function.

score 2 · Answer 1 · answered Feb 05 '24 at 23:14

In general the process of converting from "partial randomness" to "uniform randomness" is known as randomness extraction. One can show that "hashes" are good randomness extractors, but these hashes are universal hash functions, not cryptographic hash functions. See the statement of the Leftover Hash Lemma.

This is all to say that the natural way to convert your intuitive hope for something into a provable result is

replace your cryptographic hash function family with a 2-universal hash function family $\mathcal{F}$, where $\mathcal{F}\ni f: \mathcal{X}\to\{0,1\}^m$,
Replace sampling $y\gets \{0,1\}^n$ with uniformly sampling a hash $f\gets \mathcal{F}$

If you do both of these, then the LHL says that if $m \leq H_\infty(X) - 2 \log(1/\epsilon)$, then

$$ \Delta((f, f(X)), (f,U)) \leq \epsilon, $$

where $U\gets \{0,1\}^m$ is uniform, e.g. you can replace your "unpredictable, but not random" $X$ with uniform $U$, even when an adversary gets to see which function $f$ you are hashing it with. Here, $H_\infty(X) = -\log\max_x\Pr[X = x]$ is a worst-case notion of Shannon entropy.

For concrete numbers, say $\epsilon = 2^{-32}$, and $H_\infty(X)\geq 128$, e.g. the most-common value that $X$ takes on occurs with probability below $2^{-128}$. Then one can extract $H_\infty(X) - 2\log(1/\epsilon) = 128 - 2\times 32 = 64$ random bits from $X$, and an adversary can only distinguish these extracted bits from uniform bits with probability at most $2^{-32}$. These numbers may not be appropriate for your application, but hopefully how LHL can be used in your application is clear.

That all being said, this is mostly of interest for provable results outside of an idealized model (which you indicated you are interested in). In practice, replacing the universal hash function family with a cryptographic hash appears to be common, see for example this.

Hash Output Distribution for Partly Randomised Input

1 Answers1