How many data points are "enough" for linear regression?

Question

I have data points $(x_t,y_t)$ generated from $y_t = a + b x_t + \epsilon$ where $\epsilon$ is gaussian error term with zero mean and unknown variance. I want to estimate coefficients $a$ and $b$ but their is some cost associated with generating more data points. So, how many number of data points to get a "reasonable" estimate of the coefficients? Can we quantify what is "reasonable"?

It depends on the variance of $\epsilon$. But, if you can get two data points that are far away enough, so that the variance of $\epsilon$ is negligible compared to the distance between the data points, your estimate should be good enough. I guess.. — davcha, Mar 24 '16 at 14:52
@davcha: though having only two points gives absolutely no information about the variance of $\epsilon$, while three might give some clues — Henry, Mar 24 '16 at 15:35

BruceET · Answer 1 · 2016-03-24T16:59:19.047

Obviously, how big an $n$ is enough depends on your goals and criteria. A similar question is "How rich is rich enough?"

I will write the model as $Y_i = \alpha + \beta x_i + e_i,$ where intercept $\alpha$ and slope $\beta$ are unknown constants to be estimated and the $e_i$ are IID $Norm(0, \sigma_e),$ where $\sigma_e$ also has to be estimated.

You specifically mention wanting estimates $a = \hat \alpha$ of coefficient $\alpha$ and $b = \hat \beta$ of $\beta$ to be good, so I'll start there.

Slope. A 95% CI for $\beta$ is $$b \pm t^* s_{y|x}\sqrt{1/S_{xx}},$$ where $s_{y|x}$ estimates $\sigma,$ $S_{xx} = \sum_{i=1}^n (x_i - \bar x)^2,$ and $t^*$ cuts 2.5% from the upper tail of Student's t distribution with $df = n-2.$ Also, $s_{y|x}^2$ is the sum of squared residuals divided by $n-2.$ We say that the 'standard error' for estimating $\beta$ is $s_{y|x}\sqrt{1/S_{xx}}.$ Roughly speaking, $s_{y|x}$ tends to be small when the data points are well-fit by the regression line.

The regression line must pass through $(\bar x, \bar Y),$ the 'center of gravity' of the data cloud. The more a given number of $x_i$'s are spread out over the region of interest, the greater $S_{xx}$ will be and the smaller the margin of error for estimating the slope $\beta$ will be. Also, increasing the number $n$ of $X_i$'s increases $S_{xx}.$ (The sample variance of the $X_i$'s is $S_{xx}/(n-1).$ This is one instance in statistics where it is $good$ to have high variability!)

Y-Intercept. Similarly, the standard error for estimating $\alpha$ is $s_{y|x}\sqrt{\frac{1}{n} + \frac{\bar x^2}{S_{xx}}}.$ So the precision of the estimate is improved by making $n$ and $S_{xx}$ larger. More $x_i$'s and more spread out.

Prediction. Sometimes the main goal of doing a regression is to be able to predict the value of $Y_{n+1}$ corresponding to a new observation at $x_{n+1}.$ A 95% prediction interval is $$\hat Y_{n+1} \pm t^* s_{y|x}\sqrt{1 + \frac{1}{n} + \frac{(x_{n-1}-\bar x)^2}{S_{xx}}}.$$ The additional message here, based on the last term under the radical, is that prediction of a new Y-value is more precise if the new x-value is near the average of the $x_i$'s used to make the regression line.

Summary. It is not just the number $n$ of points that matters, but how spread out they are and whether their x-values are centered near where the x-values of new points of interest may lie. Of course, all of this depends on whether a linear model can truly describe the connection between x and Y values. Expressions with cut-off points $t^*$ from Student's t distribution depend on having normally distributed errors.

How many data points are "enough" for linear regression?

1 Answers1