Why does the Chi-squared test statistic follow the Chi-squared distribution

Question

I know the Chi-squared test statistic is defined as:

$$\chi^2=\sum_{i=1}^n\frac{({O_i-E_i})^2}{E_i}$$

where $O_i$ is observed data, and $E_i$ is expected.

I also know that the $\chi^2$ distribution is essentially defined as the sum of squared Gaussian random variables.

Does that mean that in order to use a Chi-squared test, one of your assumptions must be that $\sqrt{\frac{({O_i-E_i})^2}{E_i}}$ follows a Gaussian distribution? If so, is there an explanation/proof as to why this is a reasonable assumption?

Note: I didn't find any of the answers here super helpful: Why the chi-squared statistic follows chi-squared distribution?

At follows APPROXIMATELY a chi-square distribution if the sample size is large. The central limit theorem is involved. — Michael Hardy, Jan 07 '20 at 06:28
Right, I know approximately, forgot to mention that. I'll take another look at the CLT, it's been some time. — sir_thursday, Jan 07 '20 at 06:29
. . . . but there's more to this than that. See my answer below. It does not attempt to go through all the details. — Michael Hardy, Jan 07 '20 at 06:51

score 7 · Accepted Answer · answered Jan 07 '20 at 06:40

7

You have $O_i \sim \operatorname{Binomial}(m, E_i/m),$ where $m$ is the sample size.

So $\dfrac{O_i - E_i}{\sqrt{E_i(1 - (E_i/m))}} \approx \dfrac{O_i - E_i}{\sqrt{E_i}}$ is approximately normal if $n$ is large.

However, notice that $\left( \dfrac{(O_i-E_i)^2}{E_i} \right),\, i=1,\ldots,n$ are not independent, nor uncorrelated. They are negatively correlated because they are subject to the constraint $$ \sum_{i=1}^n O_i = m. $$ For example, if the throw a die $1000$ times, then the sums of the numbers of times the different outcomes occur must be $1000;$ in this case we have $n=6$ and $m=1000.$ The matrix of covariances is a $6\times6$ matrix of rank $5.$ When diagonalized, five of the diagonal entries are equal to $1$ and the sixth is $0.$ That is why the chi-square distribution has $5$ degrees of freedom. It is the distribution of the sum of $5=n-1$ independent $\operatorname N(0,1)$ random variables.

answered Jan 07 '20 at 06:40

Michael Hardy

1

A few questions... (1) why is $O_{i}$ distributed as a Binomial? Is this an assumption? (2) Is $m$ the sum of all the $O_{i}$? – sir_thursday Jan 07 '20 at 15:45
2

Also, for future reference, I found https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture11.pdf which actually walks through the proof step by step. Thanks! – sir_thursday Jan 07 '20 at 16:06
@sir_thursday : The binomial distribution is the distribution of the number of successes in a fixed (non-random) number of independent trials with the same probability of success on each trial. Throw a die a thousand times. How many "4"s do you get? That's the number of successes in 1000 independent trials with probabilty 1/6 of success on each trial. So it's binomially distributed. – Michael Hardy Jan 07 '20 at 19:18
@MichaelHardy very interesting, do you have a reference explaining this and also the other statistical tests in a similar way. I can't find any. – edamondo Nov 23 '23 at 14:51
@sir_thursday Thanks for including that reference - although the argument turns out to be "simple" the pdf for this gives a comprehensive discussion. – Chris Dec 13 '24 at 19:36

Why does the Chi-squared test statistic follow the Chi-squared distribution

1 Answers1