I am trying to find a proof for the MSE of a linear regression:
\begin{gather} \frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n} \end{gather}
The variables are defined as follows:
$\mathbf{X} \in \mathbb{R}^{n \times d}$: full column rank feature matrix with $n$ features in $d$ dimensions
$\mathbf{z} \in \mathbb{R}^{n}$: gaussian distributed noise vector with $\mathcal{N}(0, \mathrm{diag}(\sigma^2, \cdots, \sigma^2))$
$\mathbf{y} \in \mathbb{R}^{n}$: noisy measurement of a true signal $\mathbf{y}^{*}$ with additive gaussian noise $\mathbf{y} = \mathbf{y}^{*} + \mathbf{z}$
The computed optimal weights are provided by multiplying the above $\mathbf{y}$ with the pseudo-inverse of $\mathbf{X}$:
\begin{gather} \mathbf{\hat{w}} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{y}^{*} + \mathbf{z}) \end{gather}
With the true optimal weights denoted by
$$\mathbf{w}^{*} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y}^{*}$$
The expectation is taken over the noise-vector $\mathbf{z}$ with all other variables assumed to be determined/non-probabilistic.
So far I have tried all kinds of shenanigans with the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\intercal$ but could only come to the result of
$$\frac{1}{n}\mathrm{diag}(\sigma^2, \dots, \sigma^2)$$
My main problem is figuring out how $d$ gets into the equation.