9

The formula for the Chi-Square test statistic is the following:

$$\chi^2=\sum_{i=1}^n\frac{({O_i-E_i})^2}{E_i}$$

where $O_i$ is observed data, and $E_i$ is expected.

I am just curious why this follows the $\chi^2$ distribution?

Lella
  • 2,085
Quanwang
  • 101
  • 4
  • actually chi-square distribution with degree $k$ is sum of $k$ normally distributed independent variables – dato datuashvili May 31 '14 at 13:18
  • 2
    Short answer: it doesn't. Your question has an error, which, even after you fix it, doesn't automatically imply that the component you mean has a normal distribution. You should clarify the circumstances in which you think this result holds. – Glen_b Jun 01 '14 at 03:20

2 Answers2

3

It is $\frac{O_i-E_i}{E_i}$ that follows a normal distribution not its square root. We are just assuming that relative errors are Gaussian. It is just an assumption. The square of the gaussian variable ~ Gamma Distribution (Chi-square). The sum of the these squared variables follow a Chi-square with $n$ degrees of freedom. Now if we were to look at absolute values , namely $\left|\frac{O_i-E_i}{E_i}\right|$ , we would have a half-normal distribution and the sum $\sum_{i=1}^n\left|\frac{O_i-E_i}{E_i}\right|$would end up converging to a Gaussian, though unlike Chi-Square, without a clearly standard distribution for $n$ small.

Nero
  • 3,779
  • 1
    Thanks for your kind help. It is my mistake to say $\sqrt\frac{O_i-E_i}{E_i}$ follows a normal distribution, and the square root should be removed. – Quanwang Jun 08 '14 at 07:09
  • 1
    Thanks for your kind help. It is my mistake to say $\sqrt\frac{O_i-E_i}{E_i}$ follows a normal distribution, and the square root should be removed. I am curious what is the rationale behind this assumption? – Quanwang Jun 08 '14 at 07:15
1

It is correct to say that the `goodness-of-fit' statistic follows the chi-squared distribution asymptotically, not exactly. This means that the statistic lies in any interval with probability close to, or approximately equal to, that of a $ \chi^2_{n-1} $ variable lying in the same interval, provided the sample size $ N $ is large. Here I am assuming that the $E_i$s are expected frequencies arising from a completely specified model and no estimation of parameters is involved; otherwise the d.f. would change.

A neat and simple proof, as well as one that is less simple, can be found in http://sites.stat.psu.edu/~dhunter/asymp/lectures/p175to184.pdf. The simpler one goes roughly as follows: you need to observe that under the model, $ (O_1,O_2,\dots,O_n) $ has a multinomial distribution with parameters $N$ and cell probabilities $ \left(\frac{E_1}N, \frac{E_2}N, \dots, \frac{E_n}N \right) $.

This means that when $N$ is large, $ (O_1,O_2,\dots,O_n) $ has approximately an $n$-variable normal distribution, but a singular one, since $ \sum_{i=1}^N O_i \equiv N $ is non-random. Another way to interpret the singularity is to see that the parameters of the distribution are the mean vector and the dispersion matrix and the latter is singular.

However, any $n-1$ out of $ O_1, O_2, \dots, O_n $ have approximately a non-singular $(n-1)$-variate normal distribution. Choosing $ \tilde O := (O_1, O_2, \dots, O_{n-1}) $, the inverse $ \Sigma^{-1} $ of the dispersion matrix $ \Sigma $ of $\tilde O $ is calculated. Specifically, $ \Sigma^{-1} $ turns out to have all off-diagonal entries equal to $ 1 / E_n $ and for $ 1 \le 1 \le n-1$, $ 1/E_i + 1/E_n $ as the $i$th diagonal entry.

Finally, the goodness of fit statistic is shown to exactly equal the standardized sum of squares $ \{ ( \tilde O - E (\tilde O)\}^T \Sigma^{-1} \{ ( \tilde O - E (\tilde O)\} $, which approximately follows a $ \chi^2_{n-1} $ distribution because $\forall \: k $, the map on ${\mathbb R}^k $ that takes a vector $ \tilde x $ to the real number $ ( \tilde x - \tilde a )^T A ( \tilde x - \tilde a ) $, whenever $ \tilde a \in {\mathbb R}^k$ and the $k\times k $ matrix $ A $ are fixed, is a continuous function, and the standardized sum of squares from an exact $k$-variate nonsingular normal distribution is $\chi^2_k$ distributed. Here $ n-1 $ is used as $k$.

To check equality of the test statistic with $ \{ ( \tilde O - E (\tilde O)\}^T \Sigma^{-1} \{ ( \tilde O - E (\tilde O)\} $, you will need to use the facts that $ E(O_i) = E_i $ under the model; and repeatedly that $ \sum_{i=1}^n (O_i-E_i) = N - N = 0 $.

Arindam
  • 41