4

I am a really confused.

Assume we have a multiple regression model:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +\cdots+ \beta_k x_k $$

Using R we can make a test:

$$ H0: \beta_1 = \beta_2 = \cdots = \beta_k\\ H1: \beta_j \neq 0 \; \text{for at least one $j$}, $$ by the procedure summary(lm(Y~X1+X2+..+Xk, data = data)), which gives us value of F-statistics and p-value. Everything is clear.

Also, we can make a partial F-test, for example for first coefficient, using anova(lm(Y~X2+..+Xk, data = data) , lm(Y~X1+X2+..+Xk, data = data)). This is also clear.

I cannot understand, what we get by the following:

anova(lm(Y~X1+X2+..+Xk, data = data)).

It gives us k lines with F values and p values which I cannot understand... I tried to read the description and it says that this is a sequential testing of coefficients (nested models). Ok, if I run the following:

anova(lm(Y~1, data = data), lm(Y~X1, data = data))

it does not give the same results for F and p as it was in the first line in anova(lm(Y~X1+X2+..+Xk, data = data)).

Could you, please, make it clear and sorry if the question is stupid.

Mr.M
  • 493

1 Answers1

3

The question is not stupid at all. The R help files are not very useful on this topic IMHO. The answer is somewhat hidden in the following quote from the R help file for anova.lm():

Normally the F statistic is most appropriate, which compares the mean square for a row to the residual sum of squares for the largest model considered.

If something is "normally" most appropriate, this means that there are alternatives, and in fact it turns out that there are two different F-statistics involved here.

I try to explain what is going on here in terms of formulas. First, let's remind ourselves that for any null hypothesis test situation, there are usually different test statistics one could choose from. If $H_0$ is true, the test statistics will each be significant in exactly $\alpha\cdot 100\%$ of the samples (given any significance level $\alpha$), but with a given sample, they might give different results.

Now back to the regression context. Suppose we have three nested models $$ model1: \,\, y = \beta_0 + \beta_1 X_1 +u$$ $$ model2: \,\, y = \beta_0 + \beta_1 X_1+\beta_2 X_2+u$$ $$ model3: \,\, y = \beta_0 + \beta_1 X_1+ \beta_2X_2 +\beta_3X_3+u$$

Let's assume $X_1$ consists of $k_1$ predictors, $X_2$ consists of $k_2$ predictors and $X_3$ consists of $k_3$ predictors. u is normally distributed (and independent across observations) with (unknown) standard deviation $\sigma$.

Suppose that $\beta_2 = \beta_3=0$, i.e. all three models are correct.

Let's fit each model with a given dataset. I denote with SSR1, SSR2 and SSR3 the sum of squared residuals from model1, model2, model3, respectively. Using (quite involved) algebra on multivariate normal distributions, one can show that (given $\beta_2 = \beta_3=0$) the following two random variables both have an F-distribution (with different degrees of freedom): $$F_1 = \frac{(SSR1-SSR2)/k_2}{SSR2/(n-(k1+k2)-1)}$$ $$F_2 = \frac{(SSR1-SSR2)/k_2}{SSR3/(n-(k1+k2+k3)-1)}$$ Note that in $F_1$, there's no mention of model3. When we just have to decide between two nested models, this is the test statistic to use. However, if we have three (or more) nested models, we can choose which statistic we use. Let's suppose we want to test if X2 should be added to the model with X1 and the intercept. This is $H_0:\beta_2=0$ if we compare model1 with model2. For this test, we can call R typing anova(model1, model2). R uses the test statistic $F_1$ in this case. If we type anova(model3), R constructs $k_1+k_2+k_3 +1$ models (just adding a single regressor in each step) and then basically tests step by step whether adjacent models (just differing in one predictor) are both true. Now both test statistics $F_1$ and $F_2$ could be used (they both have an F-distribution under $H_0: \beta_2=\beta_3=0$), and R now uses $F_2$.

Let's check this with a toy example:

x1 = c(1.5,2,3,4,5)
x2 = c(  2,4,5,4,6)
x3 = c(  2,4,3,5,7)
y  = c(1.5,3,3,4,5)

model1 = lm(y~x1)
model2 = lm(y~x1+x2)
model3 = lm(y~x1+x2+x3)

k1=length(coef(model1))-1
k2=length(coef(model2))-k1 -1
k3=length(coef(model3))-k2 -k1 -1
n = length(y)

SSR1 = sum(residuals(model1)^2)
SSR2 = sum(residuals(model2)^2)
SSR3 = sum(residuals(model3)^2)

F1 = ((SSR1-SSR2)/k2 )/ (SSR2/(n-k1-k2-1))
F2 = ((SSR1-SSR2)/k2 )/ (SSR3/(n-k1-k2-k3-1))

anova(model1,model2)
print(F1)
anova(model3)
print(F2)

The output is:

> anova(model1,model2)
Analysis of Variance Table

Model 1: y ~ x1
Model 2: y ~ x1 + x2
  Res.Df     RSS Df Sum of Sq     F Pr(>F)
1      3 0.65244                          
2      2 0.47658  1   0.17586 0.738 0.4808

> print(F1)
[1] 0.7380133

> anova(model3)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value  Pr(>F)  
x1         1 6.1476  6.1476 58.1457 0.08301 .
x2         1 0.1759  0.1759  1.6633 0.41988  
x3         1 0.3709  0.3709  3.5076 0.31222  
Residuals  1 0.1057  0.1057                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> print(F2)
[1] 1.663348

Without knowledge of what's going on in the background, one would assume that the F-values which belong to x2 should be the same in both anova-calls. But they are not. As you can see, F-statistic $F_1$ is calculated by R in the first case, and F-statistic $F_2$ is calculated by R in the second case.

H.K.
  • 141
  • Thank you! It was really good explanation! Could you, please, explain, what you mean by

    "First, let's remind ourselves that for any null hypothesis test situation, there are usually different test statistics one could choose from. If $H_0$ is true, the test statistics will each be significant in exactly $\alpha \cdot 100%$ of the samples (given any significance level $\alpha$), but with a given sample, they might give different results."

    Why will all test statistics be significant?!

    – Mr.M May 22 '16 at 18:13
  • I meant that two test statistics don't have to give the same value for a given sample. If $H_0$ is true, all test statistics give results that lead to (wrong) rejection of $H_0$ in $\alpha\cdot 100%$ of the samples (by definition of the significance level $\alpha$), but not necessarily in the same samples, since the values of the test statistics need not be equal. In the anova() case, the test statistics $F_1$ and $F_2$ both have an F-distribution under $H_0$ (though with different degrees of freedom), but they usually result in different values (and in different p-values) for a given sample. – H.K. May 22 '16 at 20:35
  • It is a standard result that $F_1$ is F-distributed under the null. Do you have an explanation or citation for the claim that $F_2$ is also F-distributed under the null? – det Jan 19 '23 at 11:03