13
Let $n,m\in\mathbb{Z}$ with $0 \le 2m < n$. Let $X_1, \cdots, X_n$ be i.i.d. standard Gaussians and let $X_{(1)} \le X_{(2)} \le \cdots \le X_{(n)}$ denote their order statistics (i.e., $\{X_1, X_2, \cdots, X_n\} = \{X_{(1)}, X_{(2)}, \cdots, X_{(n)}\}$). Define $$Y=\frac{X_{(m+1)} + X_{(m+2)} + \cdots + X_{(n-m)}}{n-2m}.$$ Clearly, by symmetry, $\mathbb{E}[Y]=0$. What can we say about $\mathsf{Var}[Y]=\mathbb{E}[Y^2]$?

A closed-form expression for $\mathbb{E}[Y^2]$ for all $n$ and $m$ is too much to ask for. Ideally, I want explicit and reasonably tight upper (and lower) bounds. I have the following conjecture I would like to prove.

Conjecture. $$\mathbb{E}[Y^2] \le \frac{1}{n-2m}$$

This is obviously true for $m=0$ and also holds for $m=(n-1)/2$. The intuition for this is that $Y$ is essentially the average of $n-2m$ standard Gaussians, except the distribution has had its tail amputated, which should only reduce variance.

More generally, I am looking for a bound of the form $$\exists a,b,c\in(0,\infty) ~~~~\forall n \ge b \cdot m \ge a ~~~~~~~~ \mathbb{E}[Y^2] \le \frac{1}{n}\left( 1 + c \cdot \frac{m}{n}\right)~~~~~~~~~~.$$ I want the constants $a,b,c$ to be explicit and as small as possible.

I'm interested in the asymptotics as $n,m \to \infty$ and $\frac{m}{n} \to 0$. For example, $m=\log n$ is a parameter regime that interests me.

The asymptotic answer as $n\to\infty$ while $m/n \to \alpha>0$ has been studied.

Below are some numerical results (based on the average of $10^6$ draws). The trimmed mean smoothly interpolates between the mean (variance $1/n$) and the median (asymptotic variance $\pi/2n \approx 1.57/n$). Clearly one could hope for better than the conjecture.

variance of trimmed mean as a function of trimming fraction

The interpretation of this question is as follows. I have $n$ samples from a normal distribution and I discard the largest $m$ samples and the smallest $m$ samples as "outliers". I estimate the mean of the distribution using the remaining samples. This is apparently known as the truncated mean or trimmed mean.

The question is what is the mean squared error of this estimator? That is, how much does discarding supposed outliers hurt? (Without loss of generality, for the analysis, I can assume zero mean and unit variance, even though these values would be unknown in practice.)

Ideally I want a result that holds for all "nice" distributions. Nice could mean something like symmetric, continuous, light-tailed, and unimodal. However, any definition of nice should include the Gaussian.

Thomas Steinke
  • 945
  • 5
  • 23
  • Does it have to be a Gaussian? For uniform or exponential distributions, an answer could probably be extracted from the Wikipedia article: https://en.wikipedia.org/wiki/Order_statistic#Probability_distributions_of_order_statistics – David E Speyer Aug 22 '18 at 15:43
  • @DavidESpeyer Ideally I want a result that holds for all "nice" distributions, but, however "nice" is defined, it should include Gaussian. (E.g., "nice" could be symmetric distributions satisfying some moment bounds.) – Thomas Steinke Aug 22 '18 at 16:56
  • 1
    In "The Asymptotic Distribution of the Trimmed Mean" you find an asymptotic result in "On the variance of the trimmed mean" you find explicit formulas in terms of second moments of order statistics. – g g Aug 24 '18 at 15:12
  • @gg Thanks for the references. "Trimmed mean" is a helpful keyword! The first reference is useful in that gives an approximation when $n\to\infty$ and the ratio $m/n$ is fixed. The problem with the second reference is that I don't know how to compute the pairwise moments of order statistics. – Thomas Steinke Aug 24 '18 at 20:57
  • I don't understand one thing. According to your definition of statistics does $X(1)+X(2)+X(3)$ mean the summation of least 3 random variables among all of them? – Mostafa Ayaz Aug 27 '18 at 16:05

0 Answers0