4

Consider a bivariate gaussian distribution, with parameters $\mu_1$ and $\mu_2$ for the two unknown means, and $\sigma_1$, $\sigma_2$ and $\rho$ for the known covariance matrix,

\begin{align} \Sigma&=\left(\begin{array}{cc} \sigma_1^2 & \sigma_{12} \\ \sigma_{12}& \sigma_2^2 \end{array}\right) = \left(\begin{array}{cc} \sigma_1^2 & \rho \sigma_{1} \sigma_{2} \\ \rho \sigma_{1} \sigma_{2}& \sigma_2^2 \end{array}\right) \end{align}

Assume we have $N$ samples, $X_1, \ldots, X_N$, each sample comprising the two features of the bivariate gaussian.

An estimator is better than another if its variance is smaller. We are only accounting for unbiased estimators. A pair of estimators for the means is best if the sum of their variances is the minimum achievable.

1) if the covariance matrix is unknown, the best we can do to estimate $\mu_1$ and $\mu_2$ is to let

$\hat{\mu}_1 = \sum_{i=1}^N X_{i,1} /N$ and $\hat{\mu}_2 = \sum_{i=1}^N X_{i,2} /N$ right?

2) if the covariance matrix is known, the best we can do is still the same? Isn't there any way to use information about the covariance matrix to improve the estimates of the means?

In particular, the trace of the inverse of the Fisher information matrix, which in this case equals $\Sigma$, is $(\sigma_1^2+\sigma_2^2)/N$ which suggests that it is impossible to use information about the covariance matrix.

This is very puzzling to me, though, specially in light of the fact that if we want to estimate a single mean (assuming that the other mean is known to be equal 0) from a bivariate gaussian we can leverage the correlation coefficient through the following estimator

\begin{equation} \overline{\mu}_1 = \frac{\sum X_{i,1}}{N} + \rho \frac{\sigma_1}{\sigma_2}\frac{\sum X_{i,2}}{N} \end{equation} see, e.g., page 4 of Sampling: regression methods on estimation

Why the correlation coefficient is so helpful when we want to estimate a single mean (assuming the other is known), and is useless when we want to estimate the two means (assuming none of the means are known)? Is there any intuition about this result?

This paper here also inspired my question

Wilks, S. S. (1932). Moments and distributions of estimates of population parameters from fragmentary samples. The Annals of Mathematical Statistics, 3(3), 163-195.

Daniel S.
  • 863
  • You can take your formula for $\overline{\mu_1}$ and similarly apply it to $\overline{\mu_2}$, and if this isn't somehow better than just taking the sample averages (and I don't know enough about Fisher information to contradict your argument that it isn't), then it would be because the sample mean is taking advantage of all the data points when you are estimating the mean vector but you are ignoring half the information when calculating a single coordinate. But in what sense do you intend the sample means to be the "best"? – Aaron Feb 02 '20 at 02:43
  • I'm assuming that an estimator of the mean is better than another if its variance is smaller. In case of two estimators, to estimate the two means of the bivariate gaussian, I'm assuming that the two estimators are "best" if the sum of their variances is the minimum possible. This is called A-optimality. Alternatively, D-optimality means that the entropy of the estimators is minimized. I think in this example the standard simple sample means are both A-optimal and D-optimal, and they do not use information about correlation. – Daniel S. Feb 02 '20 at 03:02
  • It is puzzling to me that correlation is useless. Note that the formula for $\overline{\mu}_1$ uses the fact that we know that $\mu_2=0$. If this is not the case, we cannot use it. In particular, we cannot use it to infer $\mu_2$. – Daniel S. Feb 02 '20 at 03:02

0 Answers0