7

I am trying to understand the assumption of Simple Uniform Hashing (SUHA) as e.g., in CLRS textbook; or other courses about hashing.

The usual description given to SUHA is (cf. CLRS):

"we shall assume that any given element [i.e., key] is equally likely to hash into any of the m slots, independently of where any other element has hashed to. We call this the assumption of simple uniform hashing."

However this is quite informal, and thus I don't fully understand it. My understanding is that the assumption supposedly means that

$ Pr[h(k)=\ell]=1/m\,$, for any given key $k$ and slot in the table $\ell=0,\ldots,m-1$, where $m$ is the number of slots in the table. But since given a key $k$ the deterministic hash function $h$ is fixed on $k$, there is no probability distribution here!

Perhaps it is meant here that $k$ is a random variable distributed uniformly over all keys?

Question: what is the distribution the probability $ Pr[h(k)=\ell]=1/m\,$ is defined over in SUHA?


EDIT:

My understanding is the SUHA is an (idealized) assumption that formally means that the probability $Pr[h(k)]$ is the probability of the event of inserting a key to the table nevertheless the time of insertion (it might be the first key I insert or the last one).

In other words, the sample space of probability $\Omega$ is

$\Omega:=($the set of all [ordered] sequences of insertions to the hash table$)$.

The probability event $h(k)=\ell$ is thus the set of all [ordered] sequences of insertions to the hash table in which the $n+1$th insertion hashes to $\ell$, where $n+1$ is the ``current'' insertion that I'm analyzing and where $n$ is the current number of keys in my hash table.

In SUHA we shall also assume that $h(k)=\ell$ for every $n$ (the current position of insertion in the sequence); namely, we assume that the event $h(k)=\ell$ is completely independent of the sequence.

To sum up the answer: the probability $Pr[h(k)=\ell]$ is taken over all possible ordered sequences of insertions to the hash table.

Jack
  • 171
  • 1
  • 5

2 Answers2

5

First, the outcome of a situation being deterministic doesn't mean we will necessarily assign a probability of $1$ or $0$ to it. If I say I'm going to flip a coin, but it is actually a double-sided coin meaning both sides are heads or both sides are tails, the probability that it comes up heads is still $0.5$. Indeed, with respect to classical mechanics, the outcome of a normal coin flip is a deterministic function of the state of the universe, and even quantum mechanics would at best predict near certainty of the classical outcome. But you are correct, that if we knew $h$ and $k$ and $\ell$, then $h(k)=\ell$ would be either a logical tautology or a logical contradiction, and thus of probability $1$ or $0$ respectively.

We could consider any of $h$, $k$, or $\ell$ to be random variables, but the most natural choice would be $k$. I'll write $K$ for the random variable of the keys and reserve $k$ for some particular key. We have $$P(h(K)=\ell)=P(K\in h^{-1}(\ell)) = \mu_K(h^{-1}(\ell))$$ where $\mu_K$ is the probability measure corresponding to $K$. The simple uniform hashing assumption (SUHA) states that $$\mu_K(h^{-1}(\ell))=\frac{1}{m}$$ for every $\ell\in\{0,\dots,m-1\}$. This is simply a constraint on the probability measure $\mu_K$. There is no assumption that $K$ is "uniformly distributed" in any (other) sense. $K$ could be wildly "non-uniform" and still satisfy SUHA. Indeed, it would have to be for some hash functions, $h$.

For example, let's say $m=2$ and $h^{-1}(0)=\{k_0\}$ and thus $h^{-1}(1)=X\setminus \{k_0\}$ where $X$ is the domain of $h$. Let's further say $|X|=101$. Then for this $h$ to satisfy SUHA, it must be the case that $P(K=k_0)=\frac{1}{2}$. For $k\in X\setminus \{k_0\}$, it could be the case that there's a $k_1\in X\setminus \{k_0\}$ such that $P(K=k_1)=\frac{1}{2}$ and thus if $k\neq k_1$ then $P(K=k)=0$. Or maybe, $P(K=k)=\frac{1}{200}$ for any $k\in X\setminus \{k_0\}$. In fact, any distribution over the elements of $X\setminus \{k_0\}$ that sums to $\frac{1}{2}$ will suffice.

Typically, we would specify a distribution for $K$ and then SUHA would constrain what $h$ could be. The point is that the hash function is "well matched" to the distribution of keys whatever that is. There is no sense in stating a hash function satisfies SUHA without specifying the distribution of keys, but we can say a pair of a hash function and a distribution of keys satisfy SUHA while giving the details of neither. This is all that is needed to prove many results about hashing. We don't want to limit ourselves to only talking about "uniformly" distributed keys. A hash table will still gave the same performance guarantees for "non-uniformly" distributed keys as long as the hash function is chosen to "compensate" for the "non-uniform" distribution, i.e. to satisfy SUHA.

Derek Elkins left SE
  • 12,179
  • 1
  • 30
  • 43
2

The way I think about Simple Uniform Hashing is like this:

Since CLRS assumes the universe of keys $U$ fixed, it is also natural to assume that the probability distribution of keys is also fixed by the problem at hand. In fact, this is what CLRS say on page 274 (3rd edition) in the context of open-addressing:

Of course, a given key has a unique fixed probe sequence associated with it; what we mean here is that, considering the probability distribution on the space of keys and the operation of the hash function on the keys, each possible probe sequence is equally likely.

So let $$K: S \to U$$ be the random variable of choosing a key with its own probability distribution (which is not assumed uniform(!): some keys might be chosen much more often than the other ones).

Then if one composes $K$ with the hash function (which is a deterministic object) $$h: U \to \{0,1,...,m-1\},$$ one gets a new random variable$$h(K): S \to \{0,1,...,m-1\}$$ that would have its own probability distribution showing how often certain hash-values would be used in our experiment of randomly choosing keys.

Then the hash function $h$ satisfies SUHA when $h(K)$ has $Unif \ \{0,1,...,m-1\}$ distribution.

For example, if $U = \{k_1, k_2,k_3,k_4 \}$ consists of only four keys with $$P(K=k_1) = P(K=k_2) = 1/3, \ P(K=k_3) = P(K=k_4) = 1/6, $$ the appropriate hashing function with $m=2$ slots would e.g. be $$h(k_1)=h(k_3)=0, \quad h(k_2)=h(k_4)=1,$$ because then $$P(h(K)=0)=1/2, \quad P(h(K)=1)=1/2.$$

Certainly, for some configurations of $U, K$ and $m$ there might be no hash function $h$ satisfying SUHA. Take e.g. $U=\{k_1, k_2, k_3 \}$ with $$P(K=k_1)=2/3, P(K=k_2)=P(K=k_3)=1/6,$$ and $m = 2$. There is no way to evenly spread out $k_1,k_2,k_3$ in just two slots - if you set $h(k_1)=j$, then $$P(h(K)=j) \geq 2/3$$no matter what $h$ was.

Bananeen
  • 121
  • 4