How feasible is word-level frequency analysis over English (or any language)?

Question

Say I have some black box which, given any English word, deterministically outputs a token for that word. Assume our black box is implemented using strong cryptography, i.e. the hardness of reversing a token to its word is reducible to some known assumption.

Now, assume I have a document corpus where a document is some list of English words. I run every word in every document through my black box and create a new document set of tokenized documents. I then give the documents to some attacker who then carries out a ciphertext-only attack to try to guess what the documents say.

I'm curious as to how successful this attacker will be in recovering partial information about the documents. He'll try to use statistical attacks to fit the frequency curve of the tokens to the frequency curve of English words. This will allow him to guess the preimages of more frequent words with high confidence, but will he be able to guess less frequently-used words? Are there more advanced attacks he could use?

score 5 · Answer 1 · answered Aug 06 '14 at 07:39

The feasibility depends a lot on the length of the corpus. The more statistics, the better guesses an attacker would be able to make.

He'll try to use statistical attacks to fit the frequency curve of the tokens to the frequency curve of English words. This will allow him to guess the preimages of more frequent words with high confidence, but will he be able to guess less frequently-used words? Are there more advanced attacks he could use?

Once you knew/guessed the most common words, like prepositions, you could try n-grams. For example, if a particular token frequently appeared immediately after those for "in the", you could make some reasonable guesses. Find other instances where that token appears and you may be able to cross reference.

In reality, you'd do it programmatically, optimizing a quality function where each word pair/triplet/etc. would add or subtract from the score depending on whether that is a common combination in English.

Knowing exact word lengths would reduce the search space significantly, but with AES block length you'd get almost no extra information, since very few words are longer than the 16 – or 15 with padding – bytes you'd fit in one block, and those will probably not appear often. At least if you use ASCII/UTF-8 for English. UTF-16 would help the attacker some, and e.g. UTF-8 German might (multi-byte characters and longer words).

Trying to apply grammar to the text would be the next step, but I'm not sure it would help. N-grams already tell you what kind of words can follow e.g. an adjective.

score 5 · Answer 2 · edited Apr 13 '17 at 12:48

This sounds like a classic codebook or nomenclator.

Even if we assume a perfect random oracle that generates a completely random codeword for each word of English text, I agree with otus that frequency attacks and N-grams would likely be able to decode the most-frequently-used words. Also, a known-plaintext attack (or worse, a chosen-plaintext attack) would leak the codeword for every word in the known plaintext.

However, I agree that hapax legomenon and other rare words may be practically impossible to decrypt using a ciphertext-only attack.

I occasionally hear someone mention that 16th century ciphertexts encrypted with a nomenclator have never been broken:

score 1 · Answer 3 · edited Aug 18 '14 at 05:39

Provided the text was long enough and used an simple codebook substitution cipher, absolutely. English has common bigrams and trigrams as well as words that are typically positioned in certain places in sentences, like The.

Also, if punctuation was tokenized in the codebook, it would be incredibly easy to guess or identify . and ,, because those will be among the most common tokens and the last token is definitely ..

How feasible is word-level frequency analysis over English (or any language)?

3 Answers3

Linked