How does one choose an optimal alphabet for finding a Huffman encoding?

Question

Huffman encoding will perform best when the distribution of symbols of an alphabet that the string to be encoded uses is dyadic.

Given an arbitrary bit string S, how can we find the best alphabet for encoding? Suppose S is an ASCII file. Then given the regularity of 1-byte characters that such files exhibit, we would expect that an optimal, or at least pretty good, alphabet should contain, say, 8-bit or 16-bit words (which we then build codes for after constructing the Huffman tree).

Is there an algorithm for finding the optimal word width (assume we use constant-length words).

I would guess that to evaluate an alphabet, it would only be fair if we considered the costs of storing the actual encoding as well. This addresses the case where the alphabet is just one symbol - the entire original string. Technically the message would just be one bit, but the the encoding tree that's stored would have to indicate that the one bit used is a code for the original string, so we've just increased our message by two bits trivially!

(Constant-length encoding information such as width size, encoding table size, etc., need not be considered for the comparisons, of course).

score 4 · Accepted Answer · answered Jun 01 '14 at 05:52

The size required to store the Huffman code table scales like the number of codewords. We expect the number of unique $k$-letter words to be exponential in $k$, in fact roughly $2^{kH}$, where $H$ in the source entropy, though since the file is not infinite, for large $k$ we will actually see less. Still, this suggests that for logarithmically large $k$, most of the $k$-letter strings will be almost unique, and so the overall compression for such a $k$ would typically be quite small. In view of that, you can just try several values of $k$, and choose the best one. After you do some such experiments, you can formulate and perhaps prove a hypothesis as to the optimal value of $k$ in different situations.

score 4 · Answer 2 · answered Jun 01 '14 at 17:45

I'm having trouble answering your question for two reasons. First, the entropy changes as you change the alphabet, so the "best" alphabet depends on the correlations between characters in the class of strings that you are trying to encode, not just the "dyadicness". (This is the problem with the notion of entropy: it depends on your model of what you know about the method by which a string was generated, it is not a fundamental property of a string.) So I can't think of an algorithm that would do better than "try every character length and see which one ends up with the best result."

Second it's not clear (to me) why you are trying to choose an alphabet that gets closest to having a "dyadic" probability distribution (I had to look that term up), when in practice the main reason that people use Huffman coding is that there is an adaptive version (i.e., one that doesn't need to store the encoding,) which is "good enough", not because it is optimal. In practice (for example in the old Unix pack utility, or in the Huffman coding done at the end of MPEG encoding), the input alphabet size is chosen to be some "natural" size (bytes if you are encoding Unix files in the 1980s, some much larger alphabet if you know you are encoding Unicode).

If you want a non-adaptive encoding with a fixed-width alphabet (given that you are unaware that there should be any correlation between symbols) then you should use arithmetic coding, which gets closer to optimal for non-dyadic distributions.

If you have reason to believe that there are correlations between nearby characters then you might use something like PPM, and if you believe that there are likely to be repeated substrings (but not necessarily nearby) you might use some some kind of Lempel-Ziv compression. (Various kinds of Lempel-Ziv are used by Unix's gzip and compress).

How does one choose an optimal alphabet for finding a Huffman encoding?

2 Answers2