22

I have a set of numbers where I am randomly and independently selecting elements within a set . After a number of these random element selections I want to know the coverage of the elements in the set. Coverage being how many elements from the set have been selected at least once divided by the total number of elements in the set.

To restate this: what is the probability distribution of the different coverage values on a set after $X$ randomly, independently selected elements of the set?

qwr
  • 11,362

4 Answers4

18

If there are $n$ elements of the set then the probability that $M=m$ have been selected after a sample of $x$ (with replacement) is

$$\frac{S_2(x,m) \; n!}{n^x \; (n-m)!} $$

where $S_2(x,m)$ is a Stirling number of the second kind.

The expected value of $M$ is: $n \left(1- \left(1-\dfrac{1}{n}\right)^x \right)$.

The variance is: $n\left(1-\dfrac{1}{n}\right)^x + n^2 \left(1-\dfrac{1}{n}\right)\left(1-\dfrac{2}{n}\right)^x - n^2\left(1-\dfrac{1}{n}\right)^{2x}. $

Henry
  • 169,616
  • 1
    I think you need to swap $m$ and $x$ in your expression for the probability. Does that affect the expected value and variance calculations, too? – Mike Spivey Apr 13 '11 at 18:31
  • @Mike Spivey: you are right about the probability. $m$ should not appear in the expressions for mean and variance – Henry Apr 13 '11 at 19:32
  • @Henry This is a nice solution! –  Apr 13 '11 at 19:47
  • I cannot validate the solution, but seeing as how Mike is a stone's throw away from my hometown of Enumclaw and he votes yes on the answer, I'm going to have accept the answer. – Ross Rogers Apr 13 '11 at 22:16
  • 6
    @Ross: The argument Henry is using (I think) is as follows: The number of ways to choose which $m$ elements are to be covered is $\binom{n}{m}$. Then the number of ways to have the $x$ elements in the sample chosen only from those $m$ elements is the same as the number of ways to distribute $x$ elements into $m$ distinguishable nonempty subsets; i.e., $m! S(x,m)$, which is a Stirling number of the second kind. Then the probability is obtained by dividing by the number of ways to choose $x$ elements with replacement from $n$ elements, which is $n^x$. The factor of $m!$ cancels. – Mike Spivey Apr 13 '11 at 22:36
  • 1
    @Ross: There are a handful of us from the Pacific NW on this site. :) – Mike Spivey Apr 13 '11 at 22:36
  • I am probably severely misunderstanding something here, but if I set x = m = n then don’t I compute a probability ~1? Does that make any sense? Why would the probability of choosing 32 unique items from a set of 32 unique items when 32 items are chosen with replacement, be approximately equal to 1? I’m using the approximation S2(x,m) ≈ m^x/m!. – Shelby Moore III Jul 19 '18 at 07:41
  • 1
    @ShelbyMooreIII If $x=m=n$ then you get $\frac{S_2(n,n) ; n!}{n^n ; (n-n)!}=\frac{n!}{n^n}$ as the probability that you select the $n$ different values in the first $n$ attempts. A direct calculation would give $\frac{n}{n} \times \frac{n-1}{n} \times \cdots \times \frac{2}{n} \times \frac{1}{n}$. This is not $1$ for $n\gt 1$: e.g. for $n=2$ it is $0.5$ and for $n=6$ it is about $0.0154321$ – Henry Jul 19 '18 at 07:50
  • Ah I see my error was not checking that the asymptotic approximation is as x is much greater than m. I was coming from an answer based on your answer and failed to check the definition of the approximation at Wikipedia. Apology for wasting your time. – Shelby Moore III Jul 19 '18 at 07:57
  • 1
    I would like to use the mean and standard deviation to estimate the covered region of a genome using shallow shotgun sequencing. Is there an underlying source which states the closed form and/or derives the mean and standard deviation or did you derive these directly? – dhakim Jul 05 '22 at 23:57
  • @dhakim I had previously worked the three expressions out myself many years before this, but was not original in doing this. The probability comes from the definition of Stirling numbers of the second kind, though a recurrence like the one you suggest will also find the value. The mean is even easier: for example qwr sets the obvious method in another answer here and says how to find the variance – Henry Jul 06 '22 at 08:02
7

The expected proportion of elements covered, $E\left(\frac{m}{n}\right)$, has a simple limiting form as $n \rightarrow \infty$ with the sampling rate $ r / n $ fixed. Note that $\lim_{n \rightarrow \infty} \left(1-\frac{1}{n}\right)^n = e^{-1}$, and rewrite:

$$\lim_{n \rightarrow \infty} E\left(\frac{m}{n}\right) = 1 - e^{-\frac{r}{n}}$$

so that for example sampling $r=n$ times is expected to cover about 63% of the set. This is a reasonable approximation even for $n > 100$.

  • Hi, if you have the time, how do you modify this to get coverage when you repeat the process? for example sampling 3000 out of 60000 for 100 times in a row – Austin Aug 05 '20 at 23:01
3

Derivation of $\operatorname E [M]$ with a classic use of indicator variables and total expectation:

We sample from set $X = \{1, \dots, n \}$. Let $X_i$ be $1$ if member $i$ is in our sample, $0$ otherwise.

For a sample of size $x$, the probability that none of the values are $i$ is $\left(\frac{n-1}{n}\right)^x$. Thus the probability that the sample includes $i$, and therefore $X_i = 1$, is $1-\left(\frac{n-1}{n}\right)^x$.

We have $$\operatorname E[M] = \operatorname E[X_1 + \dots + X_n] = \operatorname E[X_1] + \dots + \operatorname E[X_n] = n \left(1-\left(\frac{n-1}{n}\right)^x\right)$$

Variance is calculated similarly, but we have to consider separately $\operatorname E[X_i^2] $ and $\operatorname E[X_i X_j]$ for $ i \ne j$.

qwr
  • 11,362
0

You can build a recurrence relation to construct the probability distribution using dynamic programming.

We define the recurrence p(m, x) as the probability of selecting m unique elements after x picks from a set of size n.

After 0 picks, we must have selected exactly 0 unique elements.

p(m=0, x=0) = 1
p(m!=0, x=0) = 0

Say the xth pick results in m unique elements selected. Then either this last pick was already previously chosen (with probability m/n from the state (m, x-1)), or this pick adds a new mth unique value (with probability 1 - (m-1)/n) from the state (m-1, x-1))

So we set the recurrence to:

p(m,x) = p(m-1, x-1) * (1 - (m-1)/n) + p(m, x-1) * (m/n)

dhakim
  • 69