8

Say I sample with replacement from a set of $N$ unique elements s.t. elements are selected with uniform probability. If I sample with replacement $M$ times from this set, what is the exact probability $P(x)$ that I have observed at least $x$ unique elements?

I believe different variants of this question have been asked on this site, however, I haven't seen a form that asks for an explicit probability $P(x)$?

For example, Ross Rogers asks a variant of this question here: probability distribution of coverage of a set after `X` independently, randomly selected members of the set, and Henry calculates the mean number of unique elements, $x$, and variance for the coverage of a set of $N$ elements after sampling with replacement $M$ times (we switch $M$ and $x$ here to fit with our variable specification).

Reproducing Henry's derivation here:

Mean[x] = $N * (1 - (1 - \frac{1}{N})^M)$

Var[x] = $N\left(1-\dfrac{1}{N}\right)^M + N^2 \left(1-\dfrac{1}{N}\right)\left(1-\dfrac{2}{N}\right)^M - N^2\left(1-\dfrac{1}{N}\right)^{2M}$

(I'll note that I don't quite understand the derivation for Var[x]...)

How can we translate this variance into our $P(x)$?

Wilk
  • 81
  • When you say "exact probability", are you after a numerical answer? That would vary, depending on the values of M and N surely? – Widor Nov 02 '12 at 15:33
  • @Widor I mean a formula for P(x) that would let me compute a numerical answer for arbitrary (positive integer) N, M, and x. – Wilk Nov 02 '12 at 20:00
  • 2
    I don't understand why you cite only the mean and variance from that post -- it explicitly gives the probability of sampling exactly $x$ elements in terms of the Stirling numbers of the second kind; all you need to do is sum it from $x$ to $N$ to get the probability of sampling at least $x$ elements. Is your question about whether that sum can be simplified? If not, I don't understand what it's about. – joriki Nov 05 '12 at 15:34

2 Answers2

5

According to Henry's answer, after a sample of $m$ (with replacement), the probability that exactly $k$ unique items have been selected (for $k\leq m$) is:

$$P(k) = \frac{S_2(m,k) \; n!}{n^m \; (n-k)!} $$

where $S_2(m,k)$ is a Stirling number of the second kind.

If you are interested in asymptotic approximation, then you can use the approximation to Stirling's number:

$$S_2(m,k)\approx {k^m \over k!}$$

So:

$$P(k)\approx {k^m n! \over n^m k! (n-k)!} = {k^m \over n^m}{n\choose k}$$

and the probability that at most $x$ unique items have been selected is:

$$ \sum_{k=1}^{x} P(k) \approx {1\over n^m}\sum_{k=1}^{x}{{n\choose k}k^m}$$

I am not sure how to simplify that sum; I opened a separate question for this.

One way to approximate $P(k)$ is to use Stirling's approximation:

$$P(k)\approx {k^m \sqrt{2\pi n}\cdot n^n \over n^m \sqrt{2\pi k}\cdot k^k \sqrt{2\pi (n-k)}\cdot (n-k)^{n-k}}$$ $$ = {k^{m-k-1/2}\cdot n^{n-m+1/2} \over \sqrt{2\pi} \cdot (n-k)^{n-k+1/2}}$$

3

I would recommend using the exact formula instead of the approximation.

For example, I was looking at the specific case with 10 distinct elements out of 15 random selections from a set of 40 elements. Using the above approximation, I got a probability of 0.789. Using the exact formula (which can be found here: https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind) I got the true answer of 0.0363.

The approximation gives a wildly different answer than the true formula and the true formula is not overly complicated.