1

I am reading through ESL and came across this equation (3.6) where the variance of the parameter estimates are provided as $$Var(\hat{\beta}) = (X^TX)^{-1}{\sigma}^2$$ I can understand the mathematics through which this equation is obtained, but I am trying to make a sense of what this equation represents in describing the variance. In that regard, I have two outstanding questions

  1. What does the transformation $(X^TX)^{-1}$ represent? This matrix should contain along the diagonals the sum-product of each data-point from each feature. And the other elements would be sum-product of 2 different features. What kind of transformation in the vector space the inverse of this matrix is causing?
  2. $\sigma^2$ is the variance in the predictions i.e., $(y_i - \hat{y_i})^2$ normalized by the number of samples. This is a scaling factor which scales the transformation matrix we are using above. But when we estimate it from the sample using the below formula, $$\hat{\sigma}^2 = \frac{1}{N-p-1}\sum_{i=1}^{N}(y_i - \hat{y_i})^2$$ the denominator becomes $(N-p-1)$. Now, by taking a sample dataset - Let's say I have 100 data points with 2 features, the denominator becomes 97 instead of 100. Granted the difference is less when you add more data points, but I don't understand the degrees of freedom well to really grasp around the idea that - 2 features and the bias (there by $p+1$ degrees restricted). Can anyone help me understand this better.

Thanks in advance!

1 Answers1

0

$X^TX$ is the information matrix that encodes the relationships between the predictors in your model:

  • Diagonal elements represent the sum of squared values for each predictor (variance).
  • Off-diagonal elements represent the covariances between pairs of predictors.

The inverse, $(X^TX)^{-1}$, adjusts for any correlations between predictors. It tells us how sensitive the parameter estimates $\hat{\beta}$ are to changes in the input data. Specifically, it gives the variance-covariance matrix of the parameter estimates, showing how much uncertainty we have about each estimate. Larger values in this inverse matrix correspond to higher uncertainty (variance) in the corresponding parameter.

The term $N-p-1$ in the denominator accounts for degrees of freedom:

  • $N$ is the total number of data points.
  • $p+1$ corresponds to the number of estimated parameters (including the intercept).

Each estimated parameter uses up one degree of freedom, leaving $N−p−1$ degrees of freedom for estimating the error variance $\sigma^2$.This adjustment prevents underestimating the true variance by correcting for the fact that we’ve fitted a model with parameters that reduce the independent information available in the data.

Robert Long
  • 3,518
  • 12
  • 30