4

Given $m$ people and an $n$ possible "days of the year", what is the expected number of days which 2 or more people share as a birthday (if the distribution of birthdays is iid uniform over the possible days) ? (Alternative formulation: We use $n$ distinct colors to paint $m$ objects. Each object gets colored with a uniformly iid randomly chosen color. What is the number of colors that color at least two objects?)

Written more precisely, let $B_i$ be independent uniform random on $S = \{ 1, ..., n \}$ for $i = 1, ..., m$. What is the expected cardinality of the set of $d \in S$ where there exist $i,j$ not equal with $B_i = B_j = d$?

This sounds similar to birthday problem - expected number of collisions which asks for the number of collisions, but that question was about the number of people who share birthdays, not the number of days shared. The answer to that question is in the range $[0,m]$ while the answer to my question is in the range $[0,n]$.

Ultimately this is actually a question about hash collisions and prefix collisions in a prefix tree, but I deemed the birthday formulation the most familiar and most likely to be helpful to others searching for it in the future.

2 Answers2

5

I gave the answer in a previous question without calculation. Here is the justification

  • The probability a day has no people is $\left(1-\frac1n\right)^m$ so the expected number of days with no people is $n\left(1-\frac1n\right)^m$
  • The probability a day has one person is ${m \choose 1}\frac1n\left(1-\frac1n\right)^{m-1}$ so the expected number of days with one person is $m\left(1-\frac1n\right)^{m-1}$
  • So the expected number of days with two or more people is $$n - n \left(1-\frac1n\right)^m - m \left(1-\frac1n\right)^{m-1}$$
Henry
  • 169,616
  • Thanks! The expected number of days with exactly one person (the second bullet point) is actually interesting for my purposes, and has a critical point (for $m$) between $n-1$ and $n$, which is somewhat surprising (IOW you go from expecting more unique birthdays to fewer exactly when number of people crosses the number of days). – R.. GitHub STOP HELPING ICE Oct 04 '21 at 13:47
  • And thanks again. While the expression was unwieldy to do some of the things I wanted, it made it easy to look at numerical behavior without resorting to running simulations, and to make predictions that accurately predicted that what I wanted to try would work. (In particular, that prefix trees on 8-bit alphabet are a bad idea but that dropping to a 4-bit alphabet makes a huge difference.) – R.. GitHub STOP HELPING ICE Oct 04 '21 at 22:31
2

For each $d\in[n]\overset{\text{Def.}}=\{1,2,\dots, n\}$, consider the random variable $$X_d\overset{\text{Def.}}=\lvert\{i\in[m]: B_i = n\}\rvert.$$ $X_d$ is the number of people that have their birthday on day $d$.

Then you are looking for the expected value of the random variable $$C=\lvert\{d\in[n]: X_d\ge 2\}\rvert,$$ i.e. the expected value of the number of days on which two or more people have their birthday. I have named the random variable "$C$" for "collisions".

Now note that, using Iverson brackets, $$C=\lvert\{d\in[n]: X_d\ge 2\}\rvert = [X_1\ge 2]+[X_2\ge 2]+\dots+[X_n\ge 2].$$

In particular, $$\mathsf E(C) = \mathsf E([X_1\ge 2])+\dots+\mathsf E([X_n\ge 2]),$$ which, since the $X_d$ have the same distribution, namely $\text{Binomial}(\text{amount}=m,p=1/n)$ (exercise), is equal to $$n\mathsf P(X_1\ge 2) = n (1-\mathsf P(X_1=0)-\mathsf P(X_1=1))= n\left(1-(1-1/n)^m - \frac mn (1-1/n)^{m-1}\right).$$

Note. The $X_d$ are not independent, but looking further into them may reveal something about the distribution of $C$.