3

Let $s = \{x_1, x_2, \ldots, x_n\}$ be a set of $n$ random non-negative integers where $\sum_i x_i = n$. And let $\{y_1, y_2, \ldots, y_{\sqrt{n}}\}$ denote a subset of size $\sqrt{n}$ of $s$, chosen uniformly at random. Defining $y$ to be $\sum_i y_i$ I am interested in calculating the value of $y$.

By linearity of expectation, I know $E[y] = \sum_i E[y_i] = \sqrt{n}$. But can I prove with high probability that $y$ is close to its mean?

I tried using Chernoff bound but unfortunately since $x_i$'s and therefore $y_i$'s are not independent, I can't apply it here.

I also tried using Chebyshev's inequality since $y_i$'s seem to be negatively correlated but I can't calculate the variance of $y_i$ and the proof would be messy even if I do.

Does anyone have any idea for a simpler proof?

Soheil
  • 33
  • 4

1 Answers1

4

You can bound the variance as follows. Let $b_i$ be an indicator variable signaling that $x_i$ was chosen. You are interested in $y = \sum_i b_i x_i$. We have $$ \mathbb{E}[(b_ix_i)^2] = \frac{x_i^2}{\sqrt{n}}, \quad \mathbb{E}[b_ix_i\cdot b_jx_j] = \frac{\sqrt{n}(\sqrt{n}-1)}{n(n-1)}x_ix_j = (1-O(1/\sqrt{n}))\frac{x_ix_j}{n}. $$ Therefore $$ \mathbb{E}[y^2]=\sum_i \frac{x_i^2}{\sqrt{n}}+\frac{1-O(1/\sqrt{n})}{n}\sum_{i \neq j} x_ix_j \leq \frac{(x_1+\cdots+x_n)^2}{n} + \frac{\sum_i x_i^2}{\sqrt{n}}. $$ This shows that $$ \mathbb{V}[y] \leq n + \frac{\sum_i x_i^2}{\sqrt{n}} - (\sqrt{n})^2 = \frac{\sum_i x_i^2}{\sqrt{n}}. $$ The error in this estimate is quite small. The quantity $\sum_i x_i^2$ is maximized when $x_i = n$ for some $i$, and we conclude that $\mathbb{V}[y] \leq n^{3/2}$. This is of course not very helpful, if you are trying to use Chebyshev's inequality!

A cheap way out is to "filter out" all elements which are at least, say, $n^{1/2+\epsilon}$. There are at most $n^{1/2-\epsilon}$ of these, and so it is very likely that the sample contains none of them. This allows us to essentially assume that all $x_i$ are at most $n^{1/2+\epsilon}$, and so $$\sum_i x_i^2 \leq \left(\sum_i x_i\right) \max_i x_i \leq n^{3/2+\epsilon}.$$ This shows that the "effective" variance is at most $n^{1+\epsilon}$, and so with high probability the deviation from the mean is at most, say, $n^{1/2+2\epsilon}$.

You can probably get better bounds with improved technology, but this bound is already pretty reasonable. When $x_1 = n$, you are always within distance at least $\sqrt{n}$ from the mean.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514