Pearson's Test for contingency tables

Question

This topic has been discussed in this forum, but I don't think the problem has been addressed completely.

To apply Pearson's test to a contingency table one computes $$\sum{ {\rm (Observed - Expected)}^2 \over {\rm Expected}}$$ and argues that under certain conditions this statistic has an approximate $\chi^2$ distribution.

The obvious question is: why don't we divide by Expected$^2$? The answer that in this way we get an (approximate) sum of standard normals in my view is not acceptable: we should do what must be done and if what we get is not nice, well, so be it. In fact, what can happen is this. Suppose that we take the original sample and split it in two, keeping the same proportions, i.e. the count of each cell is the same proportion of the total count as it was before splitting. In this case, the value of the statistic is half of what it was before, and it may well happen that a non-rejected H null is now rejected. How is this reasonable? Shouldn't the dependence / independence determination be the same in both cases? I haven't found a discussion of this aspect anywhere in the literature. Pointers would be welcome. Thanks.

Your 'splitting' argument would not work out any better if you put $E^2$ in the denominator. You are right that smaller counts would give different results. The df of an $r \times c$ table is the same regardless of the number of subjects in the study. However, the power of the test depends on the noncentral chi-squared distribution and the noncentrality parameter does depend on sample size. — BruceET, Apr 15 '18 at 06:57
See this Q&A for a discussion of chi-sq power in a simpler goodness-of-fit test. — BruceET, Apr 15 '18 at 07:06

score 3 · Answer 1 · answered Apr 15 '18 at 06:44

Suppose, as an approximation, one takes the count $X_{ij}$ in cell $(i,j)$ of an $r \times c$ contingency table to be Poisson with $E_{ij}$ as mean. Then the 'standard score' $\frac{X_{ij} - \mu_{ij}}{\sigma_{ij}}$ for that cell is estimated by $Z_{ij}= \frac{X_{ij} - E_{ij}}{\sqrt{E_{ij}}}$ because the mean $\mu$ and variance $\sigma^2$ of a Poisson distribution are numerically equal.

It follows that $Z_{ij}^2= \frac{(X_{ij}-E_{ij})^2}{E_{ij}}.$ And that is the answer to your question.

For sufficiently large counts, $Z_{ij}^2$ is approximately the square of a standard normal random variable and thus $\mathsf{Chisq}(df=1).$ If we had independent estimates $E_{ij}$ for each cell then the sum $Q = \sum_i \sum_j Z_{ij}^2$ would be approximately distributed as $\mathsf{Chisq}(df=rc).$

However, the $E_{ij}$ are estimated in terms of row, column, and grand totals. This puts linear restrictions on $Q$ so that it is approximately distributed as $\mathsf{Chisq}(df=(r-1)(c-1)).$

Originally, the distribution theory was justified by an intuitive argument not much more sophisticated than what I have written here. Later, theorems in convergence of conditional measures made the argument more rigorous. Even so, there is still the issue of the speed of convergence to $\mathsf{Chisq}((r-1)(c-1)).$ That has been settled mainly by simulation studies, resulting in rules of thumb such as 'all $E_{ij} > 5$' or 'most $E_{ij} > 5$ and all greater than 3'.

Your argument in essence is that the statistic as it is results in a chi square. I don't argue with that; I say that the statistic as it is doesn't give a correct picture of the fit between observed and expected. You also say that the splitting argument wouldn't work if we divided by the square. Well, at least you would get the same value. — J Kay, Apr 15 '18 at 09:47
But not something that is asymptotically chi-squared. Not something that can be shown to be anywhere near an accurate test. // If you want the 'right thing' and not a well-test approximation, then use the likelihood ratio test, which has a messier formula and for which no tricks are required to establish the chi-sq asymptotic distribution. // Not interested in further discussion on this. Not saying there is nothing better than the test I discussed. Just saying your idea is clearly not it. If you start with 2 + 2 = 5, you can get some pretty amazing results. — BruceET, Apr 15 '18 at 17:01

Pearson's Test for contingency tables

1 Answers1

Linked