4

I am taking a course of Econometrics:

I need help to understand as to how do we arrive at the formula for standard error of regression $$\hat{\sigma}^2=\frac{\sum{e_i^2}}{n-k}.$$

I understand the bessel's correction required to remove the bias inherent in sample variance. The proof being available at Bessels Correction Proof of Correctness.

I also found Standard deviation of error in simple linear regression

How to derive the standard error of linear regression coefficient

But I could not find the proof for the above expression (standard error of regression estimate).

I tried to open the equation on the lines of Bessels Correction proof.

$$e_i=\text{Total SS}- \text{Explained SS}$$

Then I try to expand the Explained sum of squares term, but I got stuck at

$$ \sum _{i=1}^n \operatorname {E} \left((\beta\mathbf{ X}-\bar{y} )^2 \right) = \beta^2 E(x^2)-2\beta\bar{xy}+E(\bar{y}^2)$$

I don't know how to proceed. Can anyone please help ?

Then I read this :

The term "standard error" is more often used in the context of a regression model, and you can find it as "the standard error of regression". It is the square root of the sum of squared residuals from the regression - divided sometimes by sample size n (and then it is the maximum likelihood estimator of the standard deviation of the error term), or by $n−k$ ($k$ being the number of regressors), and then it is the ordinary least squares (OLS) estimator of the standard deviation of the error term.

on Standard Error vs. Standard Deviation of Sample Mean

Can anyone suggest a textbook where I can read about these derivations in more details ?

  • Where you have $e_i=\text{Total SS}- \text{Explained SS}$, what you nee is $$ \sum_i e_i^2=\text{Total SS}- \text{Explained SS}. $$ – Michael Hardy Jun 21 '16 at 18:29

1 Answers1

2

Here's one way. This will work only if you understand matrix algebra and the geometry of $n$-dimensional Euclidean space.

The model says $y_i = \alpha_0 + \sum_{\ell=1}^k \alpha_\ell x_{\ell i} + \varepsilon_i, \quad i=1,\ldots,n $ where

  • $y_i$ and $x_{\ell i}$ are observed;
  • The $\alpha$s are not observed and are to be estimated by least squares;
  • The $\alpha$s are not random, i.e. if a new sample with all new $x$s and $y$s is taken, the $\alpha$ will not change;
  • The $x$s are in effect treated as not random. This is justified by saying we're interested in the conditional distribution of the $y$s given the $x$s. The $y$s are random only because the $\varepsilon$s are;
  • The $\varepsilon$s are not observed. The have expected value $0$ and variance $\sigma^2$ and are uncorrelated. These assumptions are weaker than those that normality and independence.

The $n\times(k+1)$ "design matrix" is $$ X= \begin{bmatrix} 1 & x_{11} & \cdots & x_{k1} \\ \vdots & \vdots & & \vdots \\ 1 & x_{1n} & \cdots & x_{kn} \end{bmatrix} $$ with independent columns and typically $n\gg k$.

The $(k+1)\times 1$ vector of coefficients to be estimated is $$ \alpha= \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \vdots \\ \alpha_k \end{bmatrix}. $$ The model can then be written as $Y= X\alpha+\varepsilon$, where $Y, \varepsilon \in\mathbb R^{n\times 1}$. Then $Y$ has expected value $X\alpha\in\mathbb R^{n\times 1}$ and variance $\sigma^2 I_n\in\mathbb R^{n\times n}$.

The "hat matrix" is $H = X(X^T X)^{-1} X^T$, an $n\times n$ matrix of rank $k+1$. The vector $\widehat Y = HY$ is the orthogonal projection of $Y$ onto the column space of $X$. It is also $\widehat Y=HY = X\widehat\alpha$, where $\widehat\alpha$ is the vector of least-squares estimates of the components of $\alpha$.

The residuals are $\widehat\varepsilon_i = e_i = Y_i-\widehat Y_i = Y_i-(\widehat\alpha_0 + \sum_{\ell=1}^k \widehat\alpha_\ell x_{\ell i})$. These are observable estimates of the unobservable errors. The vector of residuals is $$ \widehat\varepsilon = e = (I-H)Y. $$ This has expected value $(I-H)\operatorname{E}(Y) = (I-H)X\alpha = 0$.

We seek \begin{align} & \operatorname{E}(\|\widehat\varepsilon\|^2) = \operatorname{E}(\|e\|^2) \\[10pt] = {} & \operatorname{E} ( \Big((I-H)Y\Big)^T \Big((I-H)Y\Big)) \\[10pt] = {} & \operatorname{E} (Y^T (I-H) Y) \qquad \text{since } (I-H)^T = I-H = (I-H)^2. \text{ (Check that.)} \end{align} We've projected $Y$ onto the $(n-(k+1))$-dimensional column space of $I-H$. The expected value of the projection is $0$.

I claim the variance of the projection is just $\sigma^2$ times the identity operator on that $(n-(k+1))$-dimensional space. The reason for that is that $I-H$ is itself the identity operator on that $(n-(k+1))$-dimensional space, which is the orthogonal complement of the column space of $X$.

So it's as if we have a random vector $w$ in $(n-(k+1))$-dimensional space with expected value $0$ and variance $\sigma^2 I_{(n-(k+1))\times(n-(k+1))}$, and we're asking what $\operatorname{E}(\|w\|^2)$ is. And that is $\sigma^2(n-(k+1))$.

Hence the expected value of the sum of squares of residuals (which is the "unexplained" sum of squares) is $\sigma^2(n-(k+1))$.

  • Suppose $n=3$ and $k=1$. Then, in the 3-dimensional column picture, we have $\vec1$, $\vec x$ and $\vec y$. We are projecting $\vec y$ onto the 2-dimensional plane of $\vec1$ and $\vec x$. But why is $n-(k+1)=1$? – W. Zhu May 13 '18 at 08:21
  • @W.Zhu : What am I missing? You're saying suppose $n=3$ and $k=1$ and then asking why $n-(k+1) = 1$? Last time I checked, if $k=1$ then $k+1=2$ and if $n=3$ then $n-2=1. \qquad$ – Michael Hardy May 13 '18 at 10:58
  • I was confused with $HY$. Let me rephrase the question. $HY$ is a projection of $Y$ on the $(k+1)$-dimemsional subspace formed by the columns of $X$. But $(I-H)$ is an $n \times n$ matrix. How do we know that its column space has $n-(k+1)$ dimensions? – W. Zhu May 14 '18 at 15:03
  • @W.Zhu : You can show that by showing that the mapping $y\mapsto (I-H)y$ is the orthogonal projection onto the orthogonal complement of the column space of $X. \qquad$ – Michael Hardy May 14 '18 at 16:10