3

The Problem

The problem, I believe, can be stated in the following way: Given $K$ white balls all with without weight (one can say that the weight is $0$) and $N - K$ red balls with individual strictly positive weights $> 0$, if $n \leq N$ balls are picked at uniformly random without replacement, what is the expected weight ratio between the $k$ selected balls' weights and the total sum of all $N-K$ red balls' weights?

Comment: without any knowledge of neither how the weights are distributed among the red balls, nor of how many red balls there are compared to white, I guess that it could be the case that the answer is too general. I have, either way, not been able to come up with the precise algebraic formula for the solution.

My Thoughts & Efforts

I have in my efforts followed two trains of reasoning, none of which if followed could result in the sought after solution. They are mentioned as to point out that my approach has been towards finding ways that could give a rough idea of the bounds of a solution. The calculations are when no reference is found mostly my own calculation, and as such not reviewed by anyone, so consequently both these paths might be and likely are, subject to a high degree of miscalculations. I highly value any input.

The Heaviest Ball

(I) - Firstly, the heaviest ball is expected to be one of the first $\lceil {log\ N} \rceil + 1$ visited elements. This I believe can be shown by letting $X$ denote the random variable such that $X_i = 1$ for $1 \leq i \leq N$ represents the event of the heaviest ball being the $i:th$ ball being evaluated, and $X_i = 0$ of it not being the heaviest. For any specific $j$ the expected value of the heaviest being evaluated can be written as in (1). \begin{equation} \mathbb E[X_i] = 0 \times P[X_i=0] + 1 \times P[X_i=1] = P[X_i=1] \quad \quad (1) \end{equation} If we assume that all such events are mutually independent and that the balls are picked without replacement, then the probability of $P[X_i=1]$ can be written as (2). \begin{equation}\label{fun:fun_ct} P[X_i=1] = \frac{1}{N+1-i} \quad \quad (2) \end{equation} The expected number of tries before an interval from the optimal solution follows from linearity of expectation over the random variables. \begin{align*} \mathbb E[X] &= \sum_{i=1}^n P[X_i]\\ &= \sum_{i=1}^N (\frac{1}{N+1-i}) \\ &< \sum_{i=1}^N \frac{1}{i} \\ &\leq log(N) + 1 \end{align*}

From the Wikipedia page on harmonic series [1] the following relation is stated $\sum_{i=1}^{N} \frac{1}{i} \leq log(N) + 1$ which leads me to believe that if $n \geq log(N) + 1$ the, searched for, expected weight ratio should be bounded by $\frac{1}{N-K}$. If this process is repeated for rounds $r := \lceil \frac{n}{\log(N) + 1} \rceil$ wherein for each round $n$ balls are picked, then by linearity of expectation, should we not receive a weight ratio bounded from above(?) by $\frac{r}{N-K}$?

The Heaviest m Balls

(II) - Secondly, with a arguably severe abuse of terminology, I say that the expected number of red balls after picking $n$ follows from the hypergeometric distribution such that $Y \sim Hypergeometric(N, N-K, n)$ and could, therefore, be expressed, by following the example in [2], algebraically as the expression $\mathbb E[Y] = \sigma n$ where $\sigma := \frac{N-K}{N}$.

If one re-colors the $m$ heaviest red balls green such that $m < n$, then by the same reasoning the expected number of green balls after picking $n$ balls, should be $\hat{\sigma} := \frac{N-m}{N}$. The expected ratio should, therefore, be bounded from below by \begin{equation} L := \frac{\sum_{j=1}^{\hat{\sigma} n} \alpha^{j} + \sum_{j=\hat{\sigma}n +1 }^{\sigma n} \beta^{j}} {\sum_{j=1}^{N-K}\alpha^{j}} \end{equation} where $\beta^j$ is the weight of $j:th$ lightest ball, and from above by \begin{equation} U_1 := \frac{\sum_{j=1}^{\sigma n} \alpha^{j}}{\sum_{j=1}^{N-K}\alpha^{j}} \end{equation} where $\alpha^j$ is the weight of the $j:th$ heaviest ball.

Comment: we have for the upper-bound here settled with the sequence of the heaviest weights, but if one could compute the expected number of index between $n\sigma$ and $n\hat{\sigma}$ and denote it $\mathbb E[I]$, then the upper-bound can be tightened to \begin{equation} U_2 := \frac{\sum_{j=1}^{\hat{\sigma} n} \alpha^{j} + \sum_{j=\hat{\sigma}n + \mathbb E[I]}^{\sigma n + \mathbb E[I]} \alpha^{j}} {\sum_{j=1}^{N-K}\alpha^{j}} \end{equation} It might actually be that case that $\mathbb E[I] = \mathbb E[X]$ I guess, and if that is maybe we could have \begin{equation} U_3 := \frac{\sum_{j=1}^{\hat{\sigma} n} \alpha^{j} + \sum_{j=1}^{\sigma n - \hat{\sigma}n} \alpha^{j(log(N))}} {\sum_{j=1}^{N-K}\alpha^{j}} \end{equation}

Conclusion

Finally, my conclusion is that the expected ratio $\mathbb E[Z]$ must be in-between \begin{equation} L \leq \mathbb E[Z] \leq U_3 \leq U_2 \leq U_1 \end{equation} where $0 \leq Z \leq 1$ is a continuous random variable expressing the ratio in question.

  • "what is the expected weight ratio between the $k$ selected balls' weights and the total sum of all $N-K$ red balls' weights" Is that the question? I ask because 1: because the denominator is a constant, it seems simpler to ask about the numerator (expected total weight) 2: if that's so, the problem looks simple, and I just don't get what's about that calculations about " the Heaviest Ball" and so... Am I missing anything? – leonbloy Apr 10 '19 at 03:21
  • You are right in that the denominator is not necessary and in retrospect, I agree that the problem probably would be clearer without it. If the problem then appears easy, no one is happier than me :D.

    For me as a novice, I struggle with mainly two hurdles. The first hurdle being that nothing can be assumed about the weight distribution. The only thing that I know is that the red balls can be ordered from heaviest to lightest; one can here assume uniqueness, to not have come up with a way break ties, but the end result would be the same.

    – Johannes Ringmark Apr 10 '19 at 10:20
  • It has intuitively felt as if the weight distribution should affect the expected sum, but in reality, this might not be true.

 The ’Heaviest Ball’ idea asserts and aims to show that the expected sum (and consequently the ratio) increases linearly if the sample sizes are computed as r. In order for this to work, I have relied on the linearity of expectation combined with the assertion that the result would be the same regardless of if one adapts replacement or not. In short, If the heaviest ball is not removed then each round would give the same sum etc. – Johannes Ringmark Apr 10 '19 at 10:21
  • In practice, this linearity should, however, constitute a worst-case convergence profile which occurs only when all weights are equal. In all other cases the added collected weight from each latter round will be less, and thus indicate a convergence of the weight.

    This possibility of convergence indicates that the ’Heaviest Ball’ formula is pessimistic and that one should be able to express the expected sum more precisely - regardless of whether the worst-case also constitutes the expected case.

    – Johannes Ringmark Apr 10 '19 at 10:21

1 Answers1

1

Perhaps I'm misunderstanding the problem, but it seems quite simple to me.

We don't even need to make the distintion between red and white (weightless) balls.

Let $\alpha_i$ ($i=1,2 \cdots N$) be the weights of the $N$ balls. We know (actually this will be irrelevant for us) that $\alpha_i>0$ if $i\le N-K$ (red balls) and $\alpha_i=0$ if $i> N-K$ (white balls).

We pick $n<N$ balls at random, uniformly, without replacement. Let $C_i\in \{0,1\}$ , $i=1,\cdots N$ be a indicator random variable, taking value $1$ if ball number $i$ was picked, $C_i=0$ otherwise. We are interested in the expected value of

$$X=\frac{\sum_{i=1}^N C_i \alpha_i}{\sum_{i=1}^N \alpha_i} \tag1$$

Then, because the denominator is a constant, and because expectation is linear: $$E[X]=\frac{\sum_{i=1}^N E[C_i] \, \alpha_i}{\sum_{i=1}^N \alpha_i} = \frac{\sum_{i=1}^N \frac{n}{N} \, \alpha_i}{\sum_{i=1}^N \alpha_i} = \frac{n}{N} \tag2$$

leonbloy
  • 66,202
  • It seems as it was this simple, this approach seems to solve it XD - nice, very good news.

    It feels very strange that you don't have to consider the actual value of the weights. Did understand it correctly when I inferred from your answer that the weight distribution does not matter? If that is the case what changes to the problem formulation would cause the weights to affect the expected ratio?

    – Johannes Ringmark Apr 10 '19 at 16:03
  • Indeed, the weight distribution does not matter. It would matter, for example, for computing the variace of $X$... – leonbloy Apr 10 '19 at 19:08