Huffman coding is already doing a great job at compressing ascii bytes for data with a distribution like the following
byte freq freq%
---------------------
0 317116 26.1
, 151471 12.5
1 112952 9.3
F 60810 5
@ 60810 5
2 53642 4.4
8 49595 4.1
6 46548 3.8
. 45339 3.7
5 44343 3.7
3 40005 3.3
4 38873 3.2
7 38716 3.2
9 34194 2.8
\ 30405 2.5
Q 30128 2.5
; 29129 2.4
S 16118 1.3
B 14287 1.2
C 277 0
Now when I look at the 2-grams, I get the following table:
bytes freq freq%
----------------------
00 148269 12.2
,1 51993 4.3
0, 46416 3.8
1, 42348 3.5
0/ 30405 2.5
F@ 30405 2.5
FF 30405 2.5
@Q 30128 2.5
;F 29129 2.4
,0 28202 2.3
0; 27470 2.3
10 24346 2
/0 19337 1.6
....
Here's the top of the 3-gram list:
byte freq freq%
--------------------
000 70254 5.8
00, 41218 3.4
,1, 31651 2.6
FF@ 30405 2.5
00\ 30254 2.5
F@Q 30128 2.5
;FF 29129 2.4
...
So it seems it would make sense to give at least the more frequent of the 2-grams and 3-grams their own Huffman code, and perhaps encode the less frequent 2-grams/3-grams as separate 1-grams.
Is there already any (preferably practically useful) research on this, how to determine the optimum mix of 1-grams, 2-grams, 3-grams (, ...)?