1

I was reading CLRS and in theorem 11.1 it states:

In a hash table in which collisions are resolved by chaining, an unsuccessful search takes average-case time $\Theta( 1 + \alpha )$, under the assumption of simple uniform hashing.

I was trying to understand how to express this theorem with quantifiers ($\exists, \forall$) and how it relates to "average-case" time (precisely, what they mean with average case time).

Is the theorem a statement that holds for all or any key? Or does it hold for all key on average and what does "on average" mean rigorously?

Usually, I like separating the terms expectation and average because the have different meanings. Expectation usually I think of it as the expected value of something according to some underlying distribution. For Average I think of the average value given a sample. So with these in mind, is the "average-case" scenario the same as the average running time for any key? i.e. Let $T_k$ denote the amount of time to search of a key $k$. This is a random quantity because it depends on the key we search for. Define the average-case runtime for any search to be $Search = \sum^n_{k=1} T_k$. Is the theorem statement in CLRS statement about the random variable $Search$ I just defined? (Which takes into account all keys at once and hence, applied for "any key" I guess). Or what do they mean by average-case precisely?

Charlie Parker
  • 3,130
  • 22
  • 39

2 Answers2

3

There are two prominent uses of the term "average" in algorithm analysis.

  1. Average-case as a special case of expected costs

    Here, "average case" just means "expected case w.r.t. uniform distribution". Since we usually analyse with uniform inputs in mind (everything else is hard, and there's not much reason to prefer one distribution over the other in most cases).

    Example: the average-case running-time cost of sorting algorithms is often analyzed w.r.t uniformly-random permutations.

  2. Average cost in the classic sense.

    When analyzing data structures, we can look at average costs across the contained elements for a fixed instance -- no probability distribution here (well, you could...). That is, we may still have/want to consider a worst-, average- or best-case instance.

    Example: Consider BSTs. The average search cost of a given tree is the total cost for searching all contained elements (one after the other) divided by the number of contained elements. This is a classic quantity in AofA called internal path length.

Note: There are situations where average and expected do not usually mean the same thing. For instance, the expected (also: average-case) height of BSTs is in $O(\log n)$ but the average height is in $\Theta(\sqrt{n})$. That is because "expected" is implicitly (by tradition) meant w.r.t. uniformly-random permutations of insertion operations whereas "average" means the average over all BSTs of a given size. The two distributions are not the same, and apparently significantly so!

Recommendation: Whenever you use "expected" or "average-case", be very clear about which quantities are random w.r.t. which distribution.


The specific sentence you quote is indeed not clear if read in isolation -- if you ignore that CLRS specify exactly what "simple uniform hashing" means on the very same page.

There are two potentially random variables here: 1) the content of the hashtable itself, and 2) the key searched for. Simple uniform hashing is a simple way of specifying both.

  1. We abstract from sequences of insertions¹ and just assume that every one of the $n$ elements we inserted independently hashed to each of the $m$ addresses with probability $1/m$.
  2. We assume that the searched key hashes to each address with probabilty $1/m$.

That's how the proof works: our search hits each list with probability $1/m$ (cf 2), and they all have the same expected length of $n/m$ (via 1). Hence, the expected cost (under this specific model) for searching for $x$ not in the table is proportional to

$\qquad\displaystyle\begin{align*} T_u(x,n,m) &= 1 + \sum_{i=1}^m \operatorname{Pr}[h(x) = i] \cdot \mathbb{E}[\operatorname{length}(T[i])] \\ &\overset{1,2}{=} 1 +\sum_{i=1}^m \frac{1}{m} \cdot \frac{n}{m} \\ &= 1 + \frac{n}{m}. \end{align*}$

The "$+1$" is there to account for computing $h(x)$ and accessing $T[h(x)]$, the sum represents the cost for searching along the list.


  1. That is fair since the sequence of insertions does not have as much impact on the resulting structure as for, say, BSTs. The hash function shakes everything up. We don't want to talk about the precise interaction of sequence and hash function, so we just assume that the result of both is independently uniform -- that's something we can work with. It may not represent reality, of course!
Raphael
  • 73,212
  • 30
  • 182
  • 400
2

The average-case running time (or expected running time) of a randomized algorithm is typically defined to be the expectation of the running time of the algorithm, with respect to the random coins used by the algorithm (i.e., the random bits used internally in the algorithm).

If this number depends on the input, we often choose the worst case over all inputs. If it matters, a good source should say what it means.

Note that this has nothing to do with amortized analysis. Amortized running time is something completely different; it is about the running time of a sequence of operation, and applies equally well to deterministic algorithms, whereas average-case running time applies only to randomized algorithms.


Therefore, the statement you are quoting means: for all keys, the running time to search for a particular key is a random variable whose expectation is $\Theta(1+\alpha)$. (It's a random variable, because the running time depends on the randomness in the hash function.)


Caveat: You may occasionally find some situations where a particular distribution on the inputs is specified, and one takes the expectation of the running time with respect to the random choice of input. However, this usage is rare and it would probably be poor form to use average-case for this meaning, if it's not clear from context that this is what was intended.

D.W.
  • 167,959
  • 22
  • 232
  • 500