12

We are given a multinomial distribution with $k$ bins and $n$ balls. The number of balls is at most the number of bins, i.e., $\sqrt{k} \le n \le k$. The probabilities of throwing a ball into a speficic bin are monotone non-increasing, i.e. $p_1 \ge p_2 \ge \dots \ge p_k$, but we can also assume they are all equal ($p_1 = p_2 = \dots = p_k$). Let $X_i$ be the random variable representing the number of balls in bin number $i$. Let $X = \max\{X_1, \dots, X_k\}$, and let $Y = \left|\{i \in [k] : X_i = X\}\right|$. Are there any known upper bounds to $\mathbb{E}[Y]$?

Attempt. It's easy to see that, by the Birthday paradox, $\mathbb{E}[Y] = \Theta(\sqrt{k})$ when $n = \Theta(\sqrt{k})$. Also the case $n \ge k \cdot \text{polylog}(k)$ is doable. I did not succeed in other cases so far. I found the paper about balls into bins where there are bounds to the expectation of the number of bins whose ball count surpasses a certain threshold, but this is deeply different from what I am looking for.

Interestingly, in experiments it seems there are always very few bins with the maximum number of balls, both for the case $n = k \log_2 k$ (first figure) and the case $n = \sqrt{k} \log_2(k)$ (second figure).

Figure 1

In the above figure I have set $k = 2^{10}$ and $n = k \log_2(k)$.

Figure 2

In the above figure I have set $k = 2^{10}$ and $n = \sqrt{k} \log_2(k)$.

  • 1
    Very interesting. Have you seen a version of this question in any other research? I wonder what the answer looks like. Also, can you slightly elaborate on the proofs and answers in the case $n = \Theta(\sqrt{k})$ and $n \geq k \times \text{Polylog}(k)$? – Sarvesh Ravichandran Iyer Dec 08 '24 at 09:50
  • As for the case $n \ge k \cdot \text{polylog}(k)$ the paper I linked by Raab and Steger (RANDOM 1998) gives a satisfying answer. In some proof, it estimates the number of bins that surpass a certain threshold to be $O(\text{polylog}(k))$. This is a good enough estimation for me (I am looking for results up to $\text{polylog}$ factors).

    Instead, when $n = \Theta(\sqrt{k})$, by the Birthday paradox you have that with constant probabilities the birthdays fall in $n$ different days, hence the answer is $\Theta(\sqrt{k})$.

    – CuriousGuy Dec 09 '24 at 10:08
  • Thanks for the clarifications. I will see what the intermediate regions look like. – Sarvesh Ravichandran Iyer Dec 09 '24 at 10:21
  • I expect there should be some kind of "interpolation" between the two extreme cases – CuriousGuy Dec 09 '24 at 10:26
  • Exactly, but there could be a phase transition also, a jump in behaviour of the expectation with a much shorter change in the ball asymptotics. – Sarvesh Ravichandran Iyer Dec 09 '24 at 10:31

2 Answers2

1

Too long for a comment.

I tried to look empirically at the $k=n$ case so with the same number of balls and bins. The pattern of the expected number of bins with the maximum number of balls was not what I expected. I took three approaches.

  • The first was to do exact calculations. A brute force count of the $k^n$ possibilities was practical up to $k=n=7$ and produced results such as $1$ for $k=n=1$, $\frac{6}{4}=1.5$ for $k=n=2$, $\frac{39}{27}\approx 1.44$ for $k=n=3$, $\frac{364}{256}\approx 1.42$ for $k=n=4$, $\frac{4505}{3125}\approx 1.44$ for $k=n=5$, $\frac{70356}{46656}\approx 1.51$ for $k=n=6$, and $\frac{1309483}{823543}\approx 1.59$ for $k=n=7$. So it goes up then down and then up again. An alternative exact calculation is possible over the partitions of $n$ up to a higher value and the chart below shoes this up to $k=n=50$ with black circles; it might be possible to go a little further, but soon there will be too many partitions.

  • The second was to do simulations. I took $10^5$ samples in each case up to $k=n=2^{10}=1024$ before I got bored of waiting for results and these are shown with light blue + in the chart; these were close to the exact results where I could compare them. The pattern for the expectations going up then down then up again seems to repeat for larger $k=n$ too.

  • The third was to attempt a crude approximation, assuming each bin expects to receive $\lambda=\frac{n}{k} =1$ ball here; this actually has a $\operatorname{Bin}(n,\frac1k)$ distribution but I assumed a $\operatorname{Poisson}(\lambda)$ distribution - this will be close for large $k=n$ though not when they are small. I also assumed the number of balls in in each was independent of the other bins, clearly very wrong when there are few balls or bins (the total number of balls should be a constraint) but not so wrong when there are many of each. This approximation is shown as the red line up to $k=n=2^{14}= 16384$ and looks like a reasonable approximation to the simulations from about $50$ upwards. The up and down pattern appears to continue.

My rationalisation for the up and down pattern is that having more balls increases the opportunity for several bins to share a particular maximum number (up) but more balls also increases the opportunity for one of those to then have one more and so be alone (down) and these dominate at different times.

The chart itself is drawn on a log-scale for $k=n$ so both ends of the pattern shown are clear. Based on the values shown up to $k=n=2^{14} = 16384$ it seems to suggest that the expectation may grow as $O(\log(n))$ and possibly more slowly, but of course proves neither.

Chart of Expected number of bins with maximum number of balls

Henry
  • 169,616
-1

I would say that in general $\mathbb{E}[Y] \approx 1$.

I suppose that each ball is thrown into a bin uniformly at random.

Lets introduce the indicator variable:

$$X_i=\cases{1,& if the $i^{\rm th}$ bin contains $m$ balls,\cr0,&otherwise.}$$

Consider the sum: $$S_{k}=\sum_{i=1}^{k}X_i$$

$S_{k}$ is then the number of bins (random variable) that contain exactly m balls.

The probability that the $i^{\rm th}$ bin contains exactly $m$ balls is

$$p_{m}=\binom{n}{m}\left ( \frac{1}{k} \right )^{m}\left (1- \frac{1}{k} \right )^{n-m}$$

The expectation:

$$E(X_i)=1p_{m}+0(1-p_{m})=p_{m}$$ Therefore

$$E(S_{k})=\sum_{i=1}^{k}E(X_i)=kp_{m}$$

or $$E(S_{k})=k\binom{n}{m}\left ( \frac{1}{k} \right )^{m}\left (1- \frac{1}{k} \right )^{n-m}$$ For simplicity, let us consider the asymptotic case where

$$n=\lambda k;\lambda=\text{const};k\to \infty $$

Then $$E(S_{k})\to k \frac{\lambda^{m}}{m!}e^{-\lambda}$$ Consider a numerical example where $k=365;n=k;\Rightarrow \lambda=1$

Numbers are rounded

$m;E(S_{k})$

$0;134$

$1;134$

$2;67$

$3;22$

$4;6$

$5;1$

$6;0.18$

We see that the expected maximum number of balls in any urn is 5, and with high probability only one urn contains this number of balls.

Martin Gales
  • 7,927
  • How do you relate the random variables $S_m$ and $Y$? Also, stating that in general $\mathbb{E}[Y] \approx 1$ is false. It depends on $n$ and $k$, as shown in the question. If $n$ is small w.r.t. $k$, e.g., $n << \sqrt{k}$, then $\mathbb{E}[Y] = \Theta(n)$. – CuriousGuy Jan 07 '25 at 14:38
  • @CuriousGuy. Actually, I meant the case where the number of urns is comparable to the number of balls. The extremes you refer to are uninteresting. Say 1000 urns and 10 balls. It is clear that the most probable case in this case is the case where 10 urns have a maximum value of 1 – Martin Gales Jan 07 '25 at 16:39
  • I don't agree at all. What happens when we have $k$ urns and $n$ balls with $\sqrt{k} << n << k$? This case is interesting and non-trivial, which is the whole point of my question. – CuriousGuy Jan 08 '25 at 17:12
  • @CuriousGuy It's too theoretical for me. The formula I came up with doesn't put any restrictions on balls and urns. I can use this to predict how many urns with a specific number of balls will be on average. But I commend you for taking the initiative to investigate a specific situation in depth. I'm more practical. – Martin Gales Jan 08 '25 at 18:05
  • Simulation suggests that with $365$ bins and $365$ balls the expected number of bins with the maximum number of balls is about $2.6$, while with $365$ bins and $20$ balls the expected number of bins with the maximum number of balls is about $12.3$. That figure is high because of the possibility that all the balls are in different bins. – Henry Feb 05 '25 at 02:22