9

I am reading about the birthday attack in Wikipedia:

We consider the following experiment. From a set of $H$ values we choose $n$ values uniformly at random thereby allowing repetitions. Let $p(n; H)$ be the probability that during this experiment at least one value is chosen more than once. This probability can be approximated as $$p(n:H) \approx 1−e^{-{n(n-1) \over 2H}}$$

My question: Where does that $e^{-{n(n-1) \over 2H}}$ value come from?

Maarten Bodewes
  • 96,351
  • 14
  • 169
  • 323
juaninf
  • 2,781
  • 3
  • 21
  • 29

3 Answers3

9

Let's first calculate the chance that every value is unique.

The chance of two values picked being unique is $H - 1 \over H$ because when picking the second value you only have $H - 1$ unique picks left, with one pick being non-unique. Picking a third number has a chance of $H - 2 \over H$ to be unique, so the total chance of picking 3 unique numbers is ${H - 1 \over H} \times {H - 2 \over H}$. This can be continued so that the chance of picking $n$ unique values is:

$${H - 1 \over H} \times {H - 2 \over H} \times \cdots \times {H - (n - 2) \over H} \times {H - (n - 1) \over H}$$

This is exactly the same as the limit for the birthday paradox.

Then a bit more math is required. An approximation is used. For $x \leq 1$ the taylor series of $e$ gives an approximation:

$$e^x \approx 1 + x$$

Then note the following:

$${H - 1 \over H} = 1 - {1 \over H} \approx e^{-{1 \over H}}$$

Then we rewrite the calculate as such:

$$e^{-{1 \over H}} \times e^{-{2 \over H}} \times \cdots \times e^{-{n - 2 \over H}}\times e^{-{n - 1 \over H}} =$$ $$e^{-{1 + 2 + \cdots + n-2 + n-1 \over H}} =$$ $$e^{-{n(n-1)/2 \over H}} = e^{-{n(n-1) \over 2H}} $$

That is the probability that every value picked is unique. If not every value is unique there must be a collision, therefore the chance of a collision is:

$$p(n:H) \approx 1−e^{-{n(n-1) \over 2H}}$$

This is also the value the linked wikipedia article means, but it's a bit ambiguous with $/$ and $\cdot$ priority rules.

orlp
  • 4,355
  • 21
  • 31
2

Here is a slightly different approach: The total number of ways to pick $n$ numbers among $H$ value allowing repetition (and with the order of picking counted in) is $A=H^n$. The number of ways to pick without repetitions is $B=\frac{H!}{(H-n)!}.$

Clearly, the probability you want to compute is $(A-B)/A=1-B/A$. Now, does $B/A$ contains the exponential you look for ? Using Stirling's formula: $$x!\approx \sqrt{2\pi x}(x/e)^x.$$ We see that: $$ A/B\approx \sqrt{\frac{H-n}{H}}\left(\frac{H-n}{H}\right)^{H-n}e^n. $$

Taking logarithm, we find: $$\log(A/B)\approx (H-n+1/2)\log(1-n/H)+n\approx -(H-n+1/2)(n/H+(n/H)^2/2)+n$$

Develop and remove low order terms to obtain $\log(A/B)\approx n(n-1)/2H$.

Putting everything together indeed yields $$p(n:H)\approx 1-e^{\frac{n(n-1)}{2H}}.$$

The main advantage of this approach is that it avoids taking the product of many approximations (as in nightcracker's answer) which, in general, requires great care.

minar
  • 2,282
  • 15
  • 26
2

Here's yet another similar way to get this approximation.

Consider every pairing of n elements from H, ignoring elements paired with themselves but not requiring that the elements be unique. i.e. $Let\ H_n=\{n\ elements\ chosen\ from\ H\}, P_n=\{(h_i,h_j) | h_i,h_j \in H_n\ and\ i\ne j\}$.Each element can be matched with any other element, so there are $\frac{n(n-1)}{2}$ pairs in $P_n$, rejecting $(h_b, h_a)$ if $(h_a,h_b)$ is already present. The probability of any pair not being a match is $(1-1/H)$.

Notice that these pairs being nonmatches are not independent events since if $(h_1,h_2)$ is not a match and $(h_2,h_3)$ is not a match, the probability of $(h_1,h_3)$ not being a match is $(1-\frac{1}{H-1})$.

This approximation treats these as independent events. In that case,
Prob(no_matches_in_$P_n$)=$\ (1-1/H)^{n*(n-1)/2}=(1-1/H)^{H*n(n-1)/(2H)}\approx e^{-n*(n-1)/(2H)}$
Since $\lim_{H \to \infty}(1-1/H)^H=e^{-1}$

user1992284
  • 225
  • 1
  • 6