4

Earlier I posted this question here Probability of Seeing "X" % of Balls in "Y" Turns? where I asked the question:

  • Suppose we have integers 1,2,3...99, 100
  • Each integer has an equal probability of being selected
  • In round=1, we pick 5 numbers randomly without replacement and then put them back
  • In round=2 we again pick 5 numbers randomly without replacement and then put them back
  • We do this until round = 100

Suppose we are currently at round = n and we have seen "m" unique numbers. If we know that there are 100 total numbers - what is the probability we will have seen 99% of all numbers by round = k? (k>n)

In the comments, someone suggested that :

You can answer question 1 with a recurrence. For example, if you have seen 49 unique numbers in the first 15 rounds, then I think the probability of having seen exactly 99 unique numbers (99% of them) in the first 100 rounds is about 34.701% , of having seen exactly 100 unique numbers (i.e. all of them) is about 52.081% and of having seen fewer than 99 is about 13.218% . That compares to 33.625% , 54.708% and 11.666% without the information on the position after 15 rounds.

I am trying to write this recurrence formally. Here is my attempt:

Define variables:

  • $n$ = current round
  • $m$ = number of unique numbers seen by round n
  • $k$ = target round
  • $i$ = number of unique numbers seen by round k

We want to find the probability of seeing at least 99% of all numbers (i.e., 99 or 100 unique numbers) by round k, given that we've seen m unique numbers by round n.

Here is my attempt to define a recurrence relation for the probability:

$$P(i,r) = \text{probability of seeing exactly i unique numbers by round r}$$

In the first round, we are guaranteed to see all unique numbers (i.e. probability=1).

In the second round, we can observe the following:

$$P(5,2) = 1 \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{5}{100} = \frac{5^5}{100^5}$$ $$P(6,2) = 5 \cdot 1 \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{95}{100} = 5 \cdot \frac{5^4 \cdot 95}{100^5}$$ $$P(7,2) = 10 \cdot 1 \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{95}{100} \cdot \frac{94}{100} = 10 \cdot \frac{5^3 \cdot 95 \cdot 94}{100^5}$$ $$P(8,2) = 10 \cdot 1 \cdot \frac{5}{100} \cdot \frac{5}{100} \cdot \frac{95}{100} \cdot \frac{94}{100} \cdot \frac{93}{100} = 10 \cdot \frac{5^2 \cdot 95 \cdot 94 \cdot 93}{100^5}$$ $$P(9,2) = 5 \cdot 1 \cdot \frac{5}{100} \cdot \frac{95}{100} \cdot \frac{94}{100} \cdot \frac{93}{100} \cdot \frac{92}{100} = 5 \cdot \frac{5 \cdot 95 \cdot 94 \cdot 93 \cdot 92}{100^5}$$ $$P(10,2) = 1 \cdot \frac{95}{100} \cdot \frac{94}{100} \cdot \frac{93}{100} \cdot \frac{92}{100} \cdot \frac{91}{100} = \frac{95 \cdot 94 \cdot 93 \cdot 92 \cdot 91}{100^5}$$

The third round gets a bit more complicated.

P(6,3): We can only get this if we had 6 unique numbers after round 2 and then drew only those 6 numbers in round 3.

$$P(6,3) = P(6,2) \cdot \frac{6^5}{100^5}$$

P(7,3): We can get this in two ways:

  • We had 6 unique numbers after round 2 and drew 1 new number in round 3
  • We had 7 unique numbers after round 2 and drew no new numbers in round 3

$$P(7,3) = P(6,2) \cdot \binom{6}{4} \cdot \frac{6^4}{100^5} \cdot \frac{94}{100} + P(7,2) \cdot \frac{7^5}{100^5}$$

P(8,3): We can get this in three ways:

  • We had 6 unique numbers after round 2 and drew 2 new numbers in round 3
  • We had 7 unique numbers after round 2 and drew 1 new number in round 3
  • We had 8 unique numbers after round 2 and drew no new numbers in round 3

$$P(8,3) = P(6,2) \cdot \binom{6}{3} \cdot \frac{6^3}{100^5} \cdot \binom{94}{2} \cdot \frac{2}{100^2} + P(7,2) \cdot \binom{7}{4} \cdot \frac{7^4}{100^5} \cdot \frac{93}{100} + P(8,2) \cdot \frac{8^5}{100^5}$$

P(9,3): We can get this in four ways:

  • We had 6 unique numbers after round 2 and drew 3 new numbers in round 3
  • We had 7 unique numbers after round 2 and drew 2 new numbers in round 3
  • We had 8 unique numbers after round 2 and drew 1 new number in round 3
  • We had 9 unique numbers after round 2 and drew no new numbers in round 3

$$P(9,3) = P(6,2) \cdot \binom{6}{2} \cdot \frac{6^2}{100^5} \cdot \binom{94}{3} \cdot \frac{3}{100^3} + P(7,2) \cdot \binom{7}{3} \cdot \frac{7^3}{100^5} \cdot \binom{93}{2} \cdot \frac{2}{100^2} + P(8,2) \cdot \binom{8}{4} \cdot \frac{8^4}{100^5} \cdot \frac{92}{100} + P(9,2) \cdot \frac{9^5}{100^5}$$

etc etc

This is where I got stuck. I don't know how to recognize the pattern here, I don't know what the initial conditions are, and I don't know how to write the final formula for 99% of all numbers.

Can someone please show me how to do this?

PS: I tried to use some logic and approach this problem from another direction and show in general that:

$$P(\text{at least 99%}, k | m,n) = P(99, k | m,n) + P(100, k | m,n)$$

Using the definition of conditional probability and the properties of events:

$$\begin{align} P(\text{at least 99%}, k | m,n) &= P(\text{99 or 100 unique numbers by round k} | \text{m unique numbers by round n}) \\[10pt] &= P(\{99 \text{ unique numbers by k}\} \cup \{100 \text{ unique numbers by k}\} | \text{m unique numbers by n}) \\[10pt] &= P(A \cup B | C) \end{align}$$

Where:

  • A = event of seeing exactly 99 unique numbers by round k
  • B = event of seeing exactly 100 unique numbers by round k
  • C = event of seeing m unique numbers by round n

Using the fact that A and B are mutually exclusive events (i.e. it's impossible to have both 99 and 100 unique numbers simultaneously). For mutually exclusive events, the probability of their union is the sum of their individual probabilities:

$$\begin{align} P(A \cup B | C) &= P(A | C) + P(B | C) \\[10pt] &= P(99 \text{ unique numbers by k} | \text{m unique numbers by n}) + \\ &\quad P(100 \text{ unique numbers by k} | \text{m unique numbers by n}) \\[10pt] &= P(99, k | m,n) + P(100, k | m,n) \end{align}$$

This proves that:

$$P(\text{at least 99%}, k | m,n) = P(99, k | m,n) + P(100, k | m,n)$$

Maybe I should approach the problem this way?

konofoso
  • 681

2 Answers2

3

I think the recurrence is

$$\begin{array}{llll} P(i,r) & = & & {5 \choose 0} \dfrac{{i \choose 5}{100-i \choose 0}}{100 \choose 5}P(i,r-1)\\ & & + & {5 \choose 1} \dfrac{{i-1 \choose 4}{101-i \choose 1}}{100 \choose 5}P(i-1,r-1)\\ & & + & {5 \choose 2} \dfrac{{i-2 \choose 3}{102-i \choose 2}}{100 \choose 5}P(i-2,r-1)\\ & & + & {5 \choose 3} \dfrac{{i-3 \choose 2}{103-i \choose 3}}{100 \choose 5}P(i-3,r-1)\\ & & + & {5 \choose 4} \dfrac{{i-4 \choose 1}{104-i \choose 4}}{100 \choose 5}P(i-4,r-1)\\ & & + & {5 \choose 5} \dfrac{{i-5 \choose 0}{105-i \choose 5}}{100 \choose 5}P(i-5,r-1)\\ & = & & \sum\limits_{j=0}^5{5 \choose j} \dfrac{{i-j \choose 5-j}{100-(i-j) \choose j}}{100 \choose 5}P(i-j,r-1)\\ \end{array}$$

which is a little long, but easy enough to set up in a spreadsheet.

You would usually start with $P(0,0)=1$ and $P(i,0)=0$ for $i\not = 0$, but may find it easier with $P(5,1)=1$ and $P(i,1)=0$ for $i\not = 5$, using $P(i,r)=0$ for $i < 5$ and $r>0$ since you must see $5$ unique items in the first draw.

But if you do know for some particular $m$ and $n$ that $P(m,n)=1$ and $P(i,n)=0$ for $i\not = m$, then you can start from there instead.

Henry
  • 169,616
2

Alternate approximation approach:

The probability a specific number is missed on a given round is $0.95$ and rounds are independent, so a given number is missed after $n$ rounds with probability $.95^n$. Expectation is linear, so the expected number of missed numbers is $100 \times .95^n$. If that’s much larger than $1$, you probably haven’t seen almost everything and if it’s much less, you probably have.

In particular, you can get explicit probability bounds from the expectation. Let the probability of getting at least 2 misses is $p$ and the expectation is $E$. If we replace any time we get 0 or 1 with 1 and anytime we get at least two with 100, then $E\leq 1\times (1-p) +100p$, so $p\geq (E-1)/99$. On the other end, we can replace any low numbers with 0 and any higher numbers with $2$ to get $E\geq 2p$ and so, $p\leq E/2$.

Putting this together gives that $1-p$, the probability of getting 99% is:

$$1-E/2\leq 1-p \leq 1-(E-1)/99$$

For small $n$, the upper bound is useful while the lower bound is useful for large $n$. In particular, for $n=100$, $E\approx .592$, so $.7\leq 1-p$ is a decent lower bound.

Eric
  • 8,378