75

It is often said that hash table lookup operates in constant time: you compute the hash value, which gives you an index for an array lookup. Yet this ignores collisions; in the worst case, every item happens to land in the same bucket and the lookup time becomes linear ($\Theta(n)$).

Are there conditions on the data that can make hash table lookup truly $O(1)$? Is that only on average, or can a hash table have $O(1)$ worst case lookup?

Note: I'm coming from a programmer's perspective here; when I store data in a hash table, it's almost always strings or some composite data structures, and the data changes during the lifetime of the hash table. So while I appreciate answers about perfect hashes, they're cute but anecdotal and not practical from my point of view.

P.S. Follow-up: For what kind of data are hash table operations O(1)?

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184

4 Answers4

42

There are two settings under which you can get $O(1)$ worst-case times.

  1. If your setup is static, then FKS hashing will get you worst-case $O(1)$ guarantees. But as you indicated, your setting isn't static.

  2. If you use Cuckoo hashing, then queries and deletes are $O(1)$ worst-case, but insertion is only $O(1)$ expected. Cuckoo hashing works quite well if you have an upper bound on the total number of inserts, and set the table size to be roughly 25% larger.

There's more information here.

Alejandro Sazo
  • 674
  • 1
  • 7
  • 23
Suresh
  • 5,417
  • 2
  • 28
  • 31
23

This answer summarises parts of TAoCP Vol 3, Ch 6.4.

Assume we have a set of values $V$, $n$ of which we want to store in an array $A$ of size $m$. We employ a hash function $h : V \to [0..M)$; typically, $M \ll |V|$. We call $\alpha = \frac{n}{m}$ the load factor of $A$. Here, we will assume the natural $m=M$; in practical scenarios, we have $m \ll M$, though, and have to map down to $m$ ourselves.

The first observation is that even if $h$ has uniform characteristics¹ the probability of two values having the same hash value is high; this is essentially an instance of the infamous birthday paradox. Therefore, we will usually have to deal with conflicts and can abandon hope of $\mathcal{O}(1)$ worst case access time.

What about the average case, though? Let us assume that every key from $[0..M)$ occurs with the same probability. The average number of checked entries $C_n^S$ (successful search) resp. $C_n^U$ (unsuccessful search) depends on the conflict resolution method used.

Chaining

Every array entry contains (a pointer to the head of) a linked lists. This is a good idea because the expected list length is small ($\frac{n}{m}$) even though the probability for having collisions is high. In the end, we get \[ C_n^S \approx 1 + \frac{\alpha}{2} \quad \text{ and } \quad C_n^U \approx 1 + \frac{\alpha^2}{2} .\] This can be improved slightly by storing the lists (partly or completely) inside the table.

Linear Probing

When inserting (resp. searching a value) $v$, check positions \[h(v), h(v)-1,\dots,0,m-1,\dots,h(v)+1\] in this order until an empty position (resp. $v$) is found. The advantage is that we work locally and without secondary data structures; however, the number of average accesses diverges for $\alpha \to 1$: \[ C_n^S \approx \frac{1}{2}\left(1 +\frac{1}{1-\alpha}\right) \quad \text{ and } \quad C_n^U \approx \frac{1}{2}\left(1 +\left(\frac{1}{1-\alpha}\right)^2\right).\] For $\alpha < 0.75$, however, performance is comparable to chaining².

Double Hashing

Similar to linear probing but search step size is controlled by a second hash function that is coprime to $M$. No formal derivation is given, but empirical observations suggest \[ C_n^S \approx \frac{1}{\alpha}\ln\left(\frac{1}{1-\alpha}\right)\quad \text{ and } \quad C_n^U \approx \frac{1}{1-\alpha} .\] This method has been adapted by Brent; his variant amortises increased insertion costs with cheaper searches.

Note that removing elements from and extending tables has varying degrees of difficulty for the respective methods.

Bottom-line, you have to choose an implementation that adapts well to your typical use cases. Expected access time in $\mathcal{O}(1)$ is possible if not always guaranteed. Depending on the used method, keeping $\alpha$ low is essential; you have to trade off (expected) access time versus space overhead. A good choice for $h$ is also central, obviously.


1] As arbitrarily dumb uninformed programmers may provide $h$, any assumption regarding its quality is a stretch in practice.
2] Note how this coincides with recommendations for usage of Java's Hashtable.

Raphael
  • 73,212
  • 30
  • 182
  • 400
10

A perfect hash function will result in $\cal{O}(1)$ worst case lookup.

Moreover, if the maximum number of collisions possible is $\cal{O}(1)$, then hash table lookup can be said to be $\cal{O}(1)$ in the worst case. If the expected number of collisions is $\cal{O}(1)$, then the hash table lookup can be said to be $\cal{O}(1)$ in the average case.

Raphael
  • 73,212
  • 30
  • 182
  • 400
10

A perfect hash function can be defined as an injective function from a set $S$ to a subset of the integers $\{0, 1, 2, ..., n\}$. If a perfect hash function exists for your data and storage needs, you can easily get $O(1)$ behavior. For instance, you can get $O(1)$ performance from a hash table for the following task: given an array $l$ of integers and a set $S$ of integers, determine whether $l$ contains $x$ for each $x \in S$. A pre-procesing step would involve making a hash table in $O(|l|)$, followed by checking each element of $S$ against it in $O(|S|)$. Altogether, this is $O(|l| + |S|)$. A naive implementation using linear search might be $O(|l||S|)$; using binary search, you can do $O(\log(|l|)|S|)$ (note that this solution is $O(|l|)$ space, since the hash table must map distinct integers in $l$ to distinct bins).

EDIT: To clarify on how the hash table is generated in $O(|l|)$:

The list $l$ contains integers from a finite set $U \subset \mathbb{N}$, possibly with repeats, and $S \subseteq U$. We want to determine whether $x \in S$ is in $l$. To do so, we pre-compute a hash table for elements of $l$: a lookup table. The hash table will encode a function $h: U \rightarrow \{true, false\}$. To define $h$, initially assume $h(x) = false$ for all $x \in U$. Then, linearly scan through elements $y$ of $l$, setting $h(y) = true$. This takes $O(|l|)$ time and $O(|U|)$ space.

Notice that my original analysis assumed that $l$ contained at least $O(|U|)$ distinct elements. If it contains fewer distinct elements (say, $O(|1|)$), the space requirement may be higher (although it is no more than $O(|U|)$).

EDIT2: The hash table can be stored as a simple array. The hash function can be the identity function on $U$. Notice that the identity function is trivially a perfect hash function. $h$ is the hash table and encodes a separate function. I am being sloppy/confused in some of the above, but will try to improve it soon.

Patrick87
  • 12,924
  • 1
  • 45
  • 77