7

I am trying to get my head round the chi square test, when used with the Caesar cipher. I started off using this formula,

$$ X = \sum_{i = 1}^k \frac{f_i · f'_i}{n · n'} $$

Where $k$ is the number of distinct letters in the alphabet, $f$ is the number of times the $i$-th letter appears in the first string and $f'$ is the number of times the $i$-th letter appears in the second string. And $n$ and $n'$ are the total number of characters in the first and second strings.

However when I run this, the highest value is not always the correct result. I was expecting values to be around 0.0650, however my correct answer is coming out around 0.0700. Is this the right way of calculating the Chi value?

The formula on Wikipedia is a completely different formula, and the values should be near to 0 for the correct answer? Which has confused me.

Paŭlo Ebermann
  • 22,946
  • 7
  • 82
  • 119
Lunar
  • 215
  • 3
  • 5

1 Answers1

7

I'll assume that the objective is to assert if the distribution of the $f'_i/n'$ is sufficiently similar with the distribution of the $f_i/n$ to support that a substitution cipher (including Caesar cipher) with the same permutation table and same frequency of plaintext characters could be used in both case.

If $n \gg n'$, $f_i \gg 5$ and $f'_i \ge 5$ for each $i$, we can use the first sample as reference; the expected $f'_i$ is $n'·f_i/n$, and the usual Chi-squared test is now to compute (update: formula and condition fixed)

$$X = \sum_{i = 1}^k \frac{(f'_i-n'·f_i/n)^2}{n'·f_i/n} = \sum_{i = 1}^k \frac{(n·f'_i-n'·f_i)^2}{n·n'·f_i}$$

which should be distributed as a Chi-squared variable with $k-1$ degree of liberty under the null hypothesis. We can reject the null hypothesis (substitution cipher with same permutation table and frequency of plaintext characters, assumed independent) if $X$ is bigger than found in a table for Chi-squared confidence test. E.g. for $k$ = 27 and $X$ > 38.9, we reject the null hypothesis with confidence level 95%. CAUTION: we have disregarded the fact that letters in the plaintext are not independent, and this will tend to make $X$ bigger than predicted by the null hypothesis.

If some $f'_i$ (or $f_i$) are too small, an option is to aggregate the smaller ones into a single value (and reduce $k$ accordingly).

If fail to find a reference on what to do when $n$ and $n'$ are of the same order of magnitude.

fgrieu
  • 149,326
  • 13
  • 324
  • 622