Proof of $\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n}$

Question

I am trying to find a proof for the MSE of a linear regression:

\begin{gather} \frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n} \end{gather}

The variables are defined as follows:

$\mathbf{X} \in \mathbb{R}^{n \times d}$: full column rank feature matrix with $n$ features in $d$ dimensions
$\mathbf{z} \in \mathbb{R}^{n}$: gaussian distributed noise vector with $\mathcal{N}(0, \mathrm{diag}(\sigma^2, \cdots, \sigma^2))$
$\mathbf{y} \in \mathbb{R}^{n}$: noisy measurement of a true signal $\mathbf{y}^{*}$ with additive gaussian noise $\mathbf{y} = \mathbf{y}^{*} + \mathbf{z}$

The computed optimal weights are provided by multiplying the above $\mathbf{y}$ with the pseudo-inverse of $\mathbf{X}$:

\begin{gather} \mathbf{\hat{w}} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{y}^{*} + \mathbf{z}) \end{gather}

With the true optimal weights denoted by

$$\mathbf{w}^{*} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y}^{*}$$

The expectation is taken over the noise-vector $\mathbf{z}$ with all other variables assumed to be determined/non-probabilistic.

So far I have tried all kinds of shenanigans with the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\intercal$ but could only come to the result of

$$\frac{1}{n}\mathrm{diag}(\sigma^2, \dots, \sigma^2)$$

My main problem is figuring out how $d$ gets into the equation.

score 2 · Accepted Answer · answered Oct 28 '21 at 19:00

First, not that $w^* = (X^TX)^{-1}X^T y$ is generally incorrect, because $X^TX$ might be singular. Instead, it is expressed in terms of the pseudoinverse $w^* = X^+y$.

The solution then is to use the famous trace-trick:

$$\begin{aligned} \tfrac{1}{n}\big[‖X\hat{w} - Xw^* ‖_2^2\big] &= \tfrac{1}{n}\big[‖XX^+z‖_2^2\big] \\&= \tfrac{1}{n}[z^⊤(XX^+)^⊤ XX^+z] \\&= \tfrac{1}{n}[z^⊤XX^+z] \\&= \tfrac{1}{n}[(z^⊤XX^+z)] \\&= \tfrac{1}{n}[(XX^+zz^⊤)] \\&= \tfrac{1}{n}([XX^+zz^⊤]) \\&= \tfrac{1}{n}(XX^+⋅[zz^⊤]) \\&= \tfrac{1}{n}(XX^+⋅σ^2 _n) \\&= \tfrac{1}{n}σ^2 (XX^+) \\&= \tfrac{1}{n}σ^2 \mathbf{rank}(X) \end{aligned}$$

In particular, if $n≥d$ and $X$ has full column rank, then

$$\tfrac{1}{n}\big[‖X\hat{w} - Xw^* ‖_2^2\big] = \tfrac{d}{n}σ^2$$

Great answer and nice references. – MachineLearner Oct 28 '21 at 19:06 — MachineLearner, Oct 28 '21 at 19:06

Proof of $\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n}$

1 Answers1