1

How do you calculate the digraphic and trigraphic IOCs and the same expected IOCs? I'm aware of the formula for calculating the single letter IOC which is given at this formula but I can't find the formula for the higher orders.

Is it simply the sum of frequencies ( $F_i(F_i − 1)$ or $F_i \times F_i$ where $F_i$ is each letter frequency ) divided by 676 (26×26) or by 17576 (26×26×26)?

fgrieu
  • 149,326
  • 13
  • 324
  • 622
firefly
  • 11
  • 1

1 Answers1

0

Edit:TL;DR Note that if you take the equation $$ =26^2 \sum_{(\alpha,\beta) \in \{a,\ldots,z\}^2} \left(\frac{n_{(\alpha,\beta)}}{\binom{N}{2}}\times \frac{n_{(\alpha,\beta)}-1}{\binom{N}{2}-1}\right), $$ I provided below for digrams, and simplify it by referring to the digrams by their index $i,$ you get $$ =26^2 \sum_{i} \frac{F_i}{N'} \frac{(F_i-1)}{N'-1} $$ where $N'={\binom{N}{2}-1}=324,$ but it's not quite "the letter frequency" divided by $26^2$ since the factor $N'$ in the denominator is roughly half of $26^2$.

Let $X$ be the alphabet so $|X|=26$ for English and let $N$ be the length of the given text. We can perform dimension analysis of the formula in Wikipedia.

Think of the added terms in the formula in the Wikipedia reference given in the question as $$ |X|\sum_{x \in X}\left(\frac{n_x}{N}\times\frac{n_x-1}{N-1}\right):= |X|\sum_{x \in X}\left(\frac{n_x}{N_{slots}}\times\frac{n_x-1}{N_{slots}-1}\right) $$ where the normalizing factor of $|X|$ (alphabet size) appears because we have two ratios with dimension relative frequency, i.e., $1/|X|$ multiplied in each term giving dimension $1/|X|^2$ but then summed for $|X|$ terms giving overall dimension $1/|X|.$

Digrams: You have $X=\{a,b,\ldots,z\}^2$ with $|X|=26^2,$ and $N_{slots}=\binom{N}{2}$ which gives $$ IC_{digrams}=26^2 \sum_{(\alpha,\beta) \in \{a,\ldots,z\}^2} \left(\frac{n_{(\alpha,\beta)}}{N_{slots}}\times\frac{n_{(\alpha,\beta)}-1}{N_{slots}-1}\right)= $$ $$ =26^2 \sum_{(\alpha,\beta) \in \{a,\ldots,z\}^2} \left(\frac{n_{(\alpha,\beta)}}{\binom{N}{2}}\times \frac{n_{(\alpha,\beta)}-1}{\binom{N}{2}-1}\right) $$ where $(\alpha,\beta)$ is an ordered 2-tuple of letters over the alphabet. In this analysis slots of the form $\{1,2\}$ and $\{2,3\}$ are counted as distinct even when overlapping in some positions.

Trigrams: You have $X=\{a,b,\ldots,z\}^3$ with $|X|=26^3,$ and $N_{slots}=\binom{N}{3}$ which gives $$ IC_{trigrams}=26^3 \sum_{(x,y,z) \in \{a,\ldots,z\}^3} \left(\frac{n_{(\alpha,\beta,\gamma)}}{N_{slots}}\times \frac{n_{(\alpha,\beta,\gamma)}-1}{N_{slots}-1} \right)= $$ $$ =26^3 \sum_{(\alpha,\beta,\gamma) \in \{a,\ldots,z\}^3} \left(\frac{n_{(\alpha,\beta,\gamma)}}{\binom{N}{3}}\times \frac{n_{(\alpha,\beta,\gamma)}-1}{\binom{N}{3}-1}\times \right) $$ where $(\alpha,\beta,\gamma)$ is an ordered 3-tuple of letters over the alphabet. In this analysis slots of the form $\{1,2,3\}$ and $\{2,3,5\}$ are counted as distinct even when overlapping in some positions.

kodlu
  • 25,146
  • 2
  • 30
  • 63