3

I am trying to link my understanding of the Chi square test with my conception of the chi square distribution. More precisely - i understand the procedure of the chi square test, e.g. as when used with a cross-tabulation, and that it aims to determine whether the difference between the observed values and the expected values is statistically significant. what i am struggling with is when i read up on the chi-squared distribution - it says that this is the distribution of a sum of the squares of k independent standard normal random variables. however - when i do cross-tabulations, where do the normal random variables show up in this context / come into play? any pointers much appreciated!

1 Answers1

2

Supposing we had a table with $r$ rows and $c$ columns.

When you perform the chi-square test on the table (whose actual entries we denote as outcomes), you calculate the expected values of each entry in the table.

Once all the expected values of the entries are computed, the sum of squares of the difference between the actual entry outcome and the expected value of that entry divided by the expected value of that entry is used to calculate the chi-squared test statistic, using the following equation:-

$$ \chi^2=\sum_{m=1}^{r}\sum_{n=1}^{c}\frac{(\color{blue}{O(m,n)-E(m,n)})^2}{E(m,n)}$$ where $O(m,n)$ is the outcome for entry $(m,n)$ of the table (i.e. element in row $m$ and column $n$) and $E(m,n)$ the corresponding expected value that was calculated.

The difference between the expected value and the actual outcome (which is known as the residual) for each entry in the table has a normal distribution - for entry $(m,n)$ of the table, this will be $\color{blue}{O(m,n)-E(m,n)}$ .

As the test-statistic is the sum of squares of this difference, it will have a chi-squared distribution of degree of freedom $(r-1)\times(c-1)$. This will be the case so long as the differences are independent for each table entry - this is an assumption we need to make when we carry out the chi-squared test.

If the result of a chi-squared test is significant, where the calculated p-value is below the critical value of, say, $0.05$, we can examine which entries have significant differences between outcome and expected value. This is when the normal distribution comes into play, where we can calculate the standardised residual $(O(m,n)-E(m,n)/\sqrt{E(m,n)}$, which will be a z-score, and compare it with the critical value of $\pm 1.96$ for the $0.05$ significance level.

Alijah Ahmed
  • 11,789
  • Thanks for the answer, but why would you say O(m,n) - E(m,n) is a normal distribution. Doesn't that assume the standard deviation of the differences is equal to one? And also, how does the denominator E(m,n) affect the distribution's shape? I am also puzzled by the relation between the test statistic and the distribution! – Amr Keleg Mar 12 '23 at 13:32