I've been playing around with the Matasano crypto challenges (cryptopals.com). I had a couple false-starts on the challenge that has you creating a program to calculate the key size of a XOR encrypted file using Hamming Distance of the bits (Index of Coincidence). After some head banging and pen and paper work on graph paper, I arrived at a working solution.
I've been playing around with my new toy script with different ciphertexts and key lengths and I noticed that for key sizes greater than 12 bytes, it works well (in my limited testing) and will accurately return the correct key size. For shorter key sizes, the most probable key length returned is a multiple of the correct key size. In fact the first few most probable key sizes are multiples of the correct key size and then the correct key size will appear further down in the results but with a higher probability ranking (based on Hamming Distance), than smaller, incorrect values.
Here's some examples that demonstrate what I'm talking about:
key = "eatoin shrdlu"
key.length
13
Let's XOR encrypt our plaintext with the key and get cipherText:
cipherText = XOR-Encrypt -String data -key key
cipherText (line breaks added) 21041a01001d003501010c090745151503021d000401060c4c31040f54240803491d1b191d4c14 070e011b491a4816482421223a2841161a0e4200070017441a140914114f060800050100101914 0941190e0a06491d0d52011f160411111c454e571b1152011a1017181b010c4e57120606174c01 0a41190e020b00161e171615550714134f1d0645531f1d161f01450e1a0a49014653091e084c01 0c0c114f061c00191d01104c14450301010a06001c0e520c1505004115010d4e571b090644181d 004135190c0047161a014404141304541b064e441c48050d181d45170103070b52120a1b080501 1c4110061a0d4c1c1b0716095b
Let's run this through the XOR brute forcer to see what it says about the key size:
XOR-Brutr -MaxKeySize 40 -String cipherText -Encoding base16 | Sort-Object AvgDist
KeySize AvgDist
------- -------
13 2.74660633484163
39 2.78461538461538
26 2.81730769230769
33 2.8989898989899
40 2.905
25 2.93
38 2.94210526315789
...
Beautiful! The result correctly calculates that the key size is 13 bytes, but with smaller keys....
key = "eatoin shrdl"
key.length
12
cipherText = XOR-Encrypt -String data -key key
XOR-Brutr -MaxKeySize 40 -String cipherText -Encoding base16 | Sort-Object AvgDist
KeySize AvgDist
------- -------
24 2.58333333333333
36 2.66111111111111
38 2.85789473684211
40 2.915
19 2.95215311004785
29 2.97536945812808
22 3.00909090909091
16 3.02232142857143
23 3.03381642512077
18 3.03703703703704
37 3.03783783783784
14 3.05357142857143
30 3.05714285714286
32 3.0625
27 3.06481481481481
12 3.06578947368421
28 3.06632653061224
...
Our top result is not 12, though 12 is the greatest common denominator of the top two results.
If we set the MaxKeySize to a higher value and use a 26 byte key:
key = "eatoin shrdluuldrhs niotae"
ciphertext = XOR-Encrypt -String data -key key
XOR-Brutr -MaxKeySize 104 -String CipherText -Encoding base16 | sort avgdist | select -First 10
KeySize AvgDist
------- -------
52 2.59615384615385
78 2.62179487179487
104 2.64423076923077
81 2.66666666666667
26 2.66826923076923
84 2.67857142857143
...
Again, a multiple of the actual key rises to the top.
Question: Why does this pattern emerge? Why does the Hamming Distance favor multiples of the actual key size? Should my script determine the gcd of the first few results and weight the gcd more heavily and return it as the probable key?