The idea behind linear regression

Question

In linear algebra we consider the problem of estimaing overdetermined systems by using the normal equation. This can be applied to when we want to find the polynomial that best fits some data. The latter seams to be exactly what OLS regression does.

In the statistical setting of model fitting it seems that we assume that the data has a relationshop y=bx+a+e which is the same thing as the polynimal fitting in my view aside from the error term e which is assumed to be standard normal. The latter also leads us into thinking above the obvervations as random variables with expecated value being the linear polynomial.

Whats the idea here? Do we simply start with the common polynomial fitting idea and then add some noise to get a stochastic model that we then apply our data on? It sort of looks like we are just dressing the former idea up with the stochastic error term.

And why does one not use the the solution to the normal equation instead of using gradient methods? The assumptions and stucture of the model becomes much more clear in the former case.

The difference between linear and polynomial, say, of degree $2$ is that we assume that the data is $y=bx+a+\epsilon$ vs. $y=cx^2+bx+a+\epsilon$. For further details see Wikipedia. It is intuitively clear that polynomials of higher degree fit compliated $x\rightarrow y$ relationships better. In other words, the variance of the residual noise $\epsilon$ is smaller. That's all. — Kurt G., Dec 03 '22 at 11:26
This can be view s a geometric problem related to orthogonal approximation of a point $y$ onto a linear space ${Ax: x\in\mathbb{R}^n}$. There are several posting in MSE about this. Here is one — Mittens, Dec 05 '22 at 15:36
Does this answer your question? Prove uniqueness of solutions of different OLS matrix cases — Mittens, Dec 05 '22 at 15:37

Andrew · Accepted Answer · 2022-12-03T09:02:17.193

I've actually spent many sleepless nights thinking about this. Let me try to give some sort of an answer.

Suppose we assume the model $Y = X\beta + \epsilon$ where $X$ is a $N\times(p+1)$ data matrix, whose entries we view as fixed. And $\epsilon \sim \mathcal N(0,\sigma^2 I)$. The question is how to interpret this model. Suppose you are an experimenter , and you want to study the relationship between the height of a plant and how much water it is given. You have $N$ plants, and you water the $j$th plant $x_j$ liters of water each day, and after 30 days, you record the height $y_j$. If $y_j$ is well modeled by the normal distribution with mean $\beta_0+\beta_1x_j$ and variance $\sigma^2$, then this model is appropriate. Moreover, since you chose what $X$ is, it makes sense to treat $X$ as fixed.

Now the question is what exactly can we do with this assumption? First, we use the OLS estimator $\hat \beta = (X^TX)^{-1}X^Ty$, and from our assumptions, it follows that $\hat \beta \sim \mathcal N(\beta,\sigma^2(X^TX)^{-1})$, and it is common to take the estimator $\hat \sigma^2 = \frac{1}{N-p-1}\sum(y_j-\hat y_j)^2$.

There are two things we can do with this information: (1) We can make inference about $\beta$ (2) We can make predictions for future plant height. Let's look at both.

To do inference, the most basic thing one can do is test $H_0: \beta_0 = 0$ or $H_0: \beta_1 = 0$ or $H_0 : \beta_0=\beta_1 = 0$. Hopefully it is clear how to interpret these. Or we could also make confidence intervals for $\beta$.

To do prediction, suppose we were to give another plant $x$ liters of water every day. If $x$ is an entry of the second column of $X$, then we can clearly create a confidence interval for the true height after 30 days. What if $x$ is not an entry of the second column of $X$? Which is to say that if the amount of water we gave was $\in \{0.1,0.2,0.3\}$ and $x \notin \{0.1,0.2,0.3\}$. Here, if $.1\leq x\leq .3$, then it seems reasonable to assume that it would also hold that $y = \beta_0+\beta_1x+\epsilon$. If $x\gg.3$ or $x\ll.1$, then such a model is probably no longer reasonable, and prediction would not be appropriate.

Now, this model is for experimental data. But if we have observational data, then it is appropriate to assume that $(X,Y)$ come from some joint distribution (here I am using $X$ to denote both the random variable, and the data matrix with a column of 1's; hopefully it is clear which one I'm referring to from context). We can always write $Y = \mathbb E(Y\mid X)+\epsilon$. Here we assume $\mathbb E(Y\mid X) = \beta_0+\beta^T X + \epsilon$. We also assume that $\epsilon \mid X \sim \mathcal N(0,\sigma^2)$, where $\sigma^2$ is not a function of $X$. Then, again, we use the OLS estimator $\hat \beta = (X^TX)^{-1}X^Ty$ and it follows that $\hat \beta\mid X \sim \mathcal N(\beta, \sigma^2(X^TX)^{-1})$. (Note that $X$ on the left means something different than $X$ on the right, sorry!) How to interpret this statement? If you were to observe the exact same data matrix $X$ over and over again, the outcomes $y$ would be different each time because not all of the variability in the outcome is explained by the covariates. Hence the estimate $\hat \beta$ would be different each time, and its distribution is given by the aforementioned. You can do inference and prediction as before, and the interpretation would follow a similar reasoning to the just stated.

The idea behind linear regression

1 Answers1