The question is not stupid at all. The R help files are not very useful on this topic IMHO. The answer is somewhat hidden in the following quote from the R help file for anova.lm():
Normally the F statistic is most appropriate, which compares the mean
square for a row to the residual sum of squares for the largest model
considered.
If something is "normally" most appropriate, this means that there are alternatives, and in fact it turns out that there are two different F-statistics involved here.
I try to explain what is going on here in terms of formulas. First, let's remind ourselves that for any null hypothesis test situation, there are usually different test statistics one could choose from. If $H_0$ is true, the test statistics will each be significant in exactly $\alpha\cdot 100\%$ of the samples (given any significance level $\alpha$), but with a given sample, they might give different results.
Now back to the regression context. Suppose we have three nested models
$$ model1: \,\, y = \beta_0 + \beta_1 X_1 +u$$
$$ model2: \,\, y = \beta_0 + \beta_1 X_1+\beta_2 X_2+u$$
$$ model3: \,\, y = \beta_0 + \beta_1 X_1+ \beta_2X_2 +\beta_3X_3+u$$
Let's assume $X_1$ consists of $k_1$ predictors, $X_2$ consists of $k_2$ predictors and $X_3$ consists of $k_3$ predictors. u is normally distributed (and independent across observations) with (unknown) standard deviation $\sigma$.
Suppose that $\beta_2 = \beta_3=0$, i.e. all three models are correct.
Let's fit each model with a given dataset. I denote with SSR1, SSR2 and SSR3 the sum of squared residuals from model1, model2, model3, respectively. Using (quite involved) algebra on multivariate normal distributions, one can show that (given $\beta_2 = \beta_3=0$) the following two random variables both have an F-distribution (with different degrees of freedom):
$$F_1 = \frac{(SSR1-SSR2)/k_2}{SSR2/(n-(k1+k2)-1)}$$
$$F_2 = \frac{(SSR1-SSR2)/k_2}{SSR3/(n-(k1+k2+k3)-1)}$$
Note that in $F_1$, there's no mention of model3. When we just have to decide between two nested models, this is the test statistic to use. However, if we have three (or more) nested models, we can choose which statistic we use.
Let's suppose we want to test if X2 should be added to the model with X1 and the intercept. This is $H_0:\beta_2=0$ if we compare model1 with model2. For this test, we can call R typing anova(model1, model2). R uses the test statistic $F_1$ in this case.
If we type anova(model3), R constructs $k_1+k_2+k_3 +1$ models (just adding a single regressor in each step) and then basically tests step by step whether adjacent models (just differing in one predictor) are both true. Now both test statistics $F_1$ and $F_2$ could be used (they both have an F-distribution under $H_0: \beta_2=\beta_3=0$), and R now uses $F_2$.
Let's check this with a toy example:
x1 = c(1.5,2,3,4,5)
x2 = c( 2,4,5,4,6)
x3 = c( 2,4,3,5,7)
y = c(1.5,3,3,4,5)
model1 = lm(y~x1)
model2 = lm(y~x1+x2)
model3 = lm(y~x1+x2+x3)
k1=length(coef(model1))-1
k2=length(coef(model2))-k1 -1
k3=length(coef(model3))-k2 -k1 -1
n = length(y)
SSR1 = sum(residuals(model1)^2)
SSR2 = sum(residuals(model2)^2)
SSR3 = sum(residuals(model3)^2)
F1 = ((SSR1-SSR2)/k2 )/ (SSR2/(n-k1-k2-1))
F2 = ((SSR1-SSR2)/k2 )/ (SSR3/(n-k1-k2-k3-1))
anova(model1,model2)
print(F1)
anova(model3)
print(F2)
The output is:
> anova(model1,model2)
Analysis of Variance Table
Model 1: y ~ x1
Model 2: y ~ x1 + x2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3 0.65244
2 2 0.47658 1 0.17586 0.738 0.4808
> print(F1)
[1] 0.7380133
> anova(model3)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 6.1476 6.1476 58.1457 0.08301 .
x2 1 0.1759 0.1759 1.6633 0.41988
x3 1 0.3709 0.3709 3.5076 0.31222
Residuals 1 0.1057 0.1057
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> print(F2)
[1] 1.663348
Without knowledge of what's going on in the background, one would assume that the F-values which belong to x2 should be the same in both anova-calls. But they are not. As you can see, F-statistic $F_1$ is calculated by R in the first case, and F-statistic $F_2$ is calculated by R in the second case.
"First, let's remind ourselves that for any null hypothesis test situation, there are usually different test statistics one could choose from. If $H_0$ is true, the test statistics will each be significant in exactly $\alpha \cdot 100%$ of the samples (given any significance level $\alpha$), but with a given sample, they might give different results."
Why will all test statistics be significant?!
– Mr.M May 22 '16 at 18:13