Arithmetic coding and "the optimal compression ratio"

Question

Trying to learn more about compression techniques and found something in the wikipedia article on arithmetic coding that I'm not sure I fully grok. In describing how Huffman Coding can sometimes be inefficient, the author refers to an 'optimal compression ratio' that seems to be a function of the probabilities of a given symbol being represented at any given position in the dataset. Am I correct in understanding this to mean:

Given a set of data and a set of probabilities describing the likelihood of any given member of the dataset being a given symbol, then there is no way to represent that data encoded in that manner in fewer bits than the calculated optimal compression ratio?

In other words: there's no way to encode something in few than 'optimal' bits so don't try?

score 6 · Answer 1 · answered Jan 04 '17 at 16:48

Beware: The phrase "optimal compression ratio" is perhaps a bit misleading. It is intended to make you think of "the best compression ratio that is achievable", but there are some assumptions that it comes with -- you should really think of it as "the best compression ratio that is achievable, under a certain set of assumptions" (or "the best compression ratio that is achievable by a certain type of compression algorithm"). Since those assumptions often aren't literally true, there's no real guarantee that this is truly the smallest something can be compressed to.

In particular, one key assumption is that the symbols come from a memoryless iid source: i.e., there is a distribution on characters, and each character is sampled randomly from this distribution, independent of all previous characters. For a source that satisfies this assumption, we can calculate the best achievable compression ratio; it is the entropy.

Of course we know this isn't true in practice; in English text, if the previous letter was 'q', the next one is much more likely to be a 'u' than otherwise. So really you should think of the entropy as telling you "if you didn't take advantage of that type of knowledge, what is the best compression ratio you could achieve?".

score 3 · Accepted Answer · answered Jan 04 '17 at 06:01

3

The optimal compression ratio is the entropy, which is the optimal compression ratio due to the source coding theorem.

answered Jan 04 '17 at 06:01

Yuval Filmus

280,205
27
317
514

score 2 · Answer 3 · answered Jan 04 '17 at 17:02

Let's say your calculation showed that optimal arithmetic coding would need 3.71 bits per character on average. That means that if you compress multiple different strings of characters, you will on average use 3.71 bits per character or more.

Any single string may be coincidence consist of many characters that are much more common than the average character, and that particular string may be encoded in a lot fewer than 3.71 bits per character. Another string may consist entirely of very rare and unusual characters and use more than 3.71 bits per character.

The number also assumes that any correlation between characters has already been handled earlier, so you have purely independent characters. If there are dependencies between characters, you can usually have much better compression.

Arithmetic coding and "the optimal compression ratio"

3 Answers3