34

The other night I was hanging out with some friends and someone put on a playlist on shuffle random, where the songs are drawn uniformly at random from a fixed playlist. The person who put the playlist together forgot how many songs were in it, so the topic came up of how to estimate the size of the playlist purely based on what we hearing.

We came up with a few high-level ideas for how to do this. For example, using ideas from the Birthday Paradox, we thought we could listen until we heard a song repeated for the first time, then use that to make an educated guess about how many songs were on the playlist in total. We also thought that we could listen for a long time and build up a frequency histogram of the number of times each song was played, then use the fact that it should look somewhat normally distributed to get the mean and variance and from there estimate the total number of songs on the list.

None of us are statisticians or have a lot of training in machine learning, but I suspect that this is probably a well-studied problem and that there are some really nice techniques we can use to estimate the playlist size.

Are there a good family of techniques for estimating the playlist size? From a practical perspective, would any of these techniques be something that would be relatively easy to work out without a computer or calculator?

Thanks!

Henry
  • 169,616
  • 4
    This is interesting and I think it's essentially a "capture recapture" problem, which comes up in ecology when one wants to estimate population sizes. This might give you a place to start looking if someone here doesn't have an answer. – dsaxton Jul 05 '15 at 18:05
  • 4
    In fact, it is exactly a capture recapture problem, which you can look up at https://en.wikipedia.org/wiki/Mark_and_recapture . I was going to post an answer based on this but I think the article is sufficient enough to get you an understanding of basic methods and the asymptotics of how well they work in terms of number of samples and total population size. – user2566092 Jul 05 '15 at 18:27
  • 1
    @user2566092, this is not a standard capture recapture problem, for two reasons. First, each 'visit' captures only a single element from the population. Second, there are many visits. – vadim123 Jul 05 '15 at 18:48
  • This source calls this precise question Siobhan's problem, but unfortunately is behind a paywall. – vadim123 Jul 05 '15 at 18:58
  • 1
    @vadim123 I should have access to that paper while at work, so perhaps I'll give it a read and see if I can summarize the main ideas here. – templatetypedef Jul 05 '15 at 19:11
  • 1
    This web page gives a method for determining the most probable size of the playlist, given the total number of songs heard and the number of those that are distinct. – Brian M. Scott Jul 05 '15 at 19:13
  • 15 people think this question is well researched. Guys, c'mon – Alec Teal Jul 05 '15 at 21:16
  • 1
    You seem to be assuming that each choice is uniformly random and independent of the predecessors. Some shuffling algorithm don't work that way because they want to enforce upper and lower bounds on the duration between repetitions. – kasperd Jul 05 '15 at 21:55
  • 3
    @kasperd Oh definitely. I figured that I'd make that simplifying assumption to make the problem mathematically tractable. :-) – templatetypedef Jul 05 '15 at 21:57
  • Related? https://plus.maths.org/content/kissing-frog-mathematicians-guide-mating – BCLC Jul 12 '15 at 13:00

3 Answers3

13

From the Langford&Langford paper, suppose you have heard $m$ songs, of which $i$ were different, and $i<m$. You wish to find the size of the playlist, whose most likely value (in the maximum likelihood sense) is $\hat{N}$, which is given by $$\hat{N}=\left\lfloor \frac{1}{1-y^\star} \right\rfloor$$ where $\lfloor \cdot \rfloor$ denotes the floor function, and $y^\star$ denotes the smaller positive root of the polynomial $$y^m-iy+(i-1)=0$$

Note: if $i=m$, then all the songs you've heard so far have been distinct. Then there is no good way to estimate the size of the playlist; it might be infinite for all we know.

For convenience, here is a table of values for $\hat{N}$, all for $m=10$. For example, if you listen to $m=10$ songs and hear $i=7$ different ones among them, then the most likely size of the playlist is $\hat{N}=12$.

\begin{array} {|r|r|} \hline i & \hat{N}\\ \hline 1 &1 \\ 2&2\\ 3&3\\ 4&4\\ 5&5\\ 6&8\\ 7&12\\ 8&19\\ 9&42\\ \hline \end{array}

vadim123
  • 83,937
  • I suppose the distribution of repeats, i.e. how many songs were heard twice, how many were heard thrice, and so on, does not give any additional information? –  Jul 05 '15 at 20:07
  • It might, but the L&L paper doesn't consider that information. – vadim123 Jul 05 '15 at 20:08
  • Just to make sure, for $i=9, m=10$ you are saying that it is most likely that there are $42$ songs on the playlist, and not that the expected value is $42$ songs? – MT_ Jul 05 '15 at 21:09
  • 1
    42 songs is (slightly) more likely than 41 or 43 or any other number. – vadim123 Jul 05 '15 at 21:20
  • 1
    The distribution of repeats shouldn't give additional information, because the selection of which song is repeated is a random process (whereas whether any other song is heard is not random, since it is dependent on the number of other songs left) – Yang Jul 06 '15 at 05:35
  • Perhaps of interest: simulation results for a third method included at the start of my Comment/Answer on simulation. – BruceET Jul 06 '15 at 20:08
  • How was the MLE derived? – BCLC Jul 12 '15 at 13:01
5

As @dsaxton suggests, one method is 'capture-recapture' (also called 'mark-recapture'). See the Wikipedia article for some technical details (if you can deal with the needlessly confusing notation). Here is the basic method with an intuitive argument for it.

Method. Listen to $c = 20$ distinct songs, noting their titles. Then listen to another $r = 20$ songs (also distinct among themselves), and count the $x$ repeats between the two groups of songs.

Estimation. Let $T$ be the total number of songs on the playlist. The fraction $x/r$ is the proportion of songs from the first group that are also in the second. The fraction $c/N$ is the proportion of songs in the whole playlist that are in the first group. Intuitively, the two proportions ought to be about the same: $x/r \approx c/N$, which leads to $N \approx cr/x.$ So if you had $x = 5$ repeats, then you estimate that $N = 400/5 = 80$ songs on the playlist.

Obviously, this method does not work if you have $x = 0$ repeats, Because of the possible difficulty with $x = 0$, the estimate does not have a mean or variance. This difficulty is (technically) circumvented by adding $1$ to each of the quantities in the estimate: $N = (c+1)(r+1)/(x+1).$ Even so, the method works better for larger $x$ (and also for larger $c$ and $r$, if you have the patience).

Undoubtedly, the method could be improved by continuously monitoring of repeats, but these strategies might be too messy to implement in the recreational setting of your question ("without a computer or a calculator").

BruceET
  • 52,418
  • What do you do if your first 20 songs are not all different? – vadim123 Jul 05 '15 at 21:24
  • Good point. They are intended to be different. The capture-recapture method is hypergeometric, sampling without replacement, as in Wikipedia. (In many applications of the method $N$ is so large repeats within captured and recaptured groups are unlikely.) Editing to clarify. – BruceET Jul 06 '15 at 05:12
3

Comment on simulation results for method based on time to first repeat.

Method: Let $Y$ be the wait (number of songs) until the first repeat. Then estimate the size $N$ of the playlist as $\hat N = Y(Y-1)/2.$ This certainly satisfies OP's request for computational simplicity.

Some simulation results: Based on 100,000 iterations for each of ten values of $N$ from 10 through 200, it seems that this estimator is unbiased, $E(\hat N) = N$, but has a relatively large SD (almost as large as $N$). The distribution of $\hat N$ is strongly right-skewed.

I believe I have seen this method before, but cannot immediately recall a rigorous rationale. Perhaps it can be extended to waiting for some small number $k > 1$ of repeats to give an unbiased estimator with a smaller SD.

Comment on limited simulation results for capture-recapture and Langford (inverse coupon collecting) methods.

It is not difficult to simulate the capture-recapture method to see how it performs in a particular instance. Also, the table provided by @vadim123 makes it easy to simulate the Langford method for that one case.

To make the two methods as comparable as possible, suppose the actual length of the playlist is 50.

In the capture-recapture method, suppose we look for matches in two samples of five songs, (that is, $c = r = 5$). In 100,000 iterations we got an average estimate of playlist size of 28 with a SD of 9.5. The mean plus two standard deviations does not reach the actual value 50.

In the Langford method we count the unique outcomes among ten songs. In 100,000 iterations we got ten unique songs (and hence no estimate) in about 38,000 of the 100,000 iterations. Among the iterations that produced estimates, the average estimate playlist size was about 34 with SD about 11.5. Again seriously underestimating of the length of the playlist when we have estimates, and getting no useful estimate over a third of the time.

It is fair to say that listening to only 10 songs does not give brilliant results with either method. My guess is that the Langford method works best when the number of songs observed is larger.

Addendum after some comments: A simulation with $N = 35$ (and the same sample sizes) gives a mean estimate about 25 with SD 10 for capture recapture; mean about 31 with SD 13 and no estimate about a quarter of the time with Langford method. The point remains that neither method seems to work very well for small numbers of observed songs. As far as I am concerned, the Question still awaits a simple and more useful answer, if such exists.

BruceET
  • 52,418
  • What is an "iteration"? It seems like you are imposing a distribution on the size of the playlist, by your construction of these iterations. – vadim123 Jul 05 '15 at 21:23
  • Yes. There is no such thing as a 'generic' simulation. All parameters must be known. As mentioned here, both simulations are based on $N = 50.$ The task of the simulation is to see how well estimates generated by the two methods detect this value. A more comprehensive simulation study would require considering various values of N (hence the word 'limited'). I repeated ('iterated') each method 100,000 times to get the simulated means and SDs reported. – BruceET Jul 06 '15 at 05:19
  • It is clear that you expect this underestimation because 9 songs only gives an estimate of $42$. To make a fairer test you need to have the total number of songs much less than the highest possible estimate of the methods. – user21820 Jul 06 '15 at 06:41
  • 1
    @user21820: Yes, and even worse for capture-recapture, where the maximum possible estimate (when x = 0) is 36. Being lazy, I wanted to use the table provided for the Langford method, so options were limited. The point is that neither method seems to work very well for small numbers of observed songs--and maybe someone has a better idea. – BruceET Jul 06 '15 at 07:07