Eke a better compression ratio out of Huffman coding by mixing 1-grams and 2-grams

Question

Huffman coding is already doing a great job at compressing ascii bytes for data with a distribution like the following

 byte    freq  freq% 
---------------------
 0     317116   26.1 
 ,     151471   12.5 
 1     112952    9.3 
 F      60810      5 
 @      60810      5 
 2      53642    4.4 
 8      49595    4.1 
 6      46548    3.8 
 .      45339    3.7 
 5      44343    3.7 
 3      40005    3.3 
 4      38873    3.2 
 7      38716    3.2 
 9      34194    2.8 
 \      30405    2.5 
 Q      30128    2.5 
 ;      29129    2.4 
 S      16118    1.3 
 B      14287    1.2 
 C        277      0

Now when I look at the 2-grams, I get the following table:

 bytes    freq  freq% 
----------------------
    00  148269   12.2 
    ,1   51993    4.3 
    0,   46416    3.8 
    1,   42348    3.5 
    0/   30405    2.5 
    F@   30405    2.5 
    FF   30405    2.5 
    @Q   30128    2.5 
    ;F   29129    2.4 
    ,0   28202    2.3 
    0;   27470    2.3 
    10   24346      2 
    /0   19337    1.6 
....

Here's the top of the 3-gram list:

   byte   freq  freq% 
  --------------------
   000   70254    5.8 
   00,   41218    3.4 
   ,1,   31651    2.6 
   FF@   30405    2.5 
   00\   30254    2.5 
   F@Q   30128    2.5 
   ;FF   29129    2.4 
...

So it seems it would make sense to give at least the more frequent of the 2-grams and 3-grams their own Huffman code, and perhaps encode the less frequent 2-grams/3-grams as separate 1-grams.

Is there already any (preferably practically useful) research on this, how to determine the optimum mix of 1-grams, 2-grams, 3-grams (, ...)?

Eke a better compression ratio out of Huffman coding by mixing 1-grams and 2-grams

0 Answers0

Linked