I am reading through ESL and came across this equation (3.6) where the variance of the parameter estimates are provided as $$Var(\hat{\beta}) = (X^TX)^{-1}{\sigma}^2$$ I can understand the mathematics through which this equation is obtained, but I am trying to make a sense of what this equation represents in describing the variance. In that regard, I have two outstanding questions
- What does the transformation $(X^TX)^{-1}$ represent? This matrix should contain along the diagonals the sum-product of each data-point from each feature. And the other elements would be sum-product of 2 different features. What kind of transformation in the vector space the inverse of this matrix is causing?
- $\sigma^2$ is the variance in the predictions i.e., $(y_i - \hat{y_i})^2$ normalized by the number of samples. This is a scaling factor which scales the transformation matrix we are using above. But when we estimate it from the sample using the below formula, $$\hat{\sigma}^2 = \frac{1}{N-p-1}\sum_{i=1}^{N}(y_i - \hat{y_i})^2$$ the denominator becomes $(N-p-1)$. Now, by taking a sample dataset - Let's say I have 100 data points with 2 features, the denominator becomes 97 instead of 100. Granted the difference is less when you add more data points, but I don't understand the degrees of freedom well to really grasp around the idea that - 2 features and the bias (there by $p+1$ degrees restricted). Can anyone help me understand this better.
Thanks in advance!