0

I have the following encodings: A=0 , B=10, C=11 Their probabilities are: $P(A)= 1/2 , P(B)= 1/3 , P(C)=1/6 $ I calculated the average length (in bits) per symbol of this encoding by doing the following: $$1/2 * 1 + 1/3 * 2 + 1/6 * 2 = 1.5 $$ The book is asking if it's possible to achieve $\sqrt{2} \approx 1.4$ I thought about it but changing the encoding by making it longer would make the average higher or changing the encoding by making it shorter would mix stuff together. For instance, if I want to decode the following I would get lost when changing it: $011100100$ which is now $acbaba$.

Is it possible or am I right by saying it's not? If it's not, how can I argue about it?

user21820
  • 60,745
  • See if this helps you: https://math.stackexchange.com/questions/730206/how-to-make-the-encoding-of-symbols-needs-only-1-58496-bits-symbol-as-carried-ou?rq=1 –  Dec 19 '17 at 08:39
  • @Rohan I just checked it before posting, it didn't :/ –  Dec 19 '17 at 08:39
  • i tried this way:

    https://math.stackexchange.com/a/730246/481197

    but i ended up with a higer answer, namely: 1.6

    –  Dec 19 '17 at 08:46
  • @AbdulMalekAltawekji Which book are you reading and why? Seems strange that you would be solving such a problem without knowing how to apply entropy to the problem. – JiK Dec 19 '17 at 12:35
  • @JiK I'm reading [Thomas_H._Cormen]Algorithms unlocked. It's one of the exercises and it explains how to apply the entropy but doesn't provide the answers to check my answer. –  Dec 19 '17 at 12:39
  • @AbdulMalekAltawekji You should include your work in the question, then, and explain which specific part in your solution you are doubting. – JiK Dec 19 '17 at 13:59

2 Answers2

5

Given a finite probability distribution $p:=(p_i)_{1\leq i\leq n}$ its entropy is defined by $$H(p):=-\sum_{i=1}^n p_i \log_2(p_i)\ .$$ If $p$ models the frequencies of the letters of an alphabet then $H(p)$ turns out to be the average number of bits per letter. This is the essential content of Shannon theory, and cannot be explained in a few lines. In the case $p=\bigl({1\over2},{1\over3},{1\over6}\bigr)$ one obtains $H(p)=1.45915$. This is what you can reach "in the limit" through a clever encoding. But $1.41421$ is definitely not attainable under the given circumstances.

  • Thanks for your answer! Can you tell me how would i be able to achieve 1.45915? –  Dec 19 '17 at 09:40
2

The Huffman code is the best you can achieve for encoding single symbols from a given set. To achieve a better encoding, you must encode combinations of several symbols at once.

For example, for two-symbol combinations, you get the probabilities: $$\begin{aligned} p(AA) &= \frac14 & p(AB) &= \frac16 & p(AC) &= \frac1{12}\\ p(BA) &= \frac16 & p(BB) &= \frac19 & p(BC) &= \frac1{18}\\ p(CA) &= \frac1{12} & p(CB) &= \frac1{18} & p(CC) &= \frac1{36} \end{aligned}$$

Applying the Huffman code to this, you can get (e.g. using this tool): $$\begin{aligned} AA &\to 10 & AB &\to 111 & AB &\to 1100\\ BA &\to 00 & BB &\to 010 & BC &\to 01111\\ CA &\to 1101 & CB &\to 0110 & CC &\to 01110 \end{aligned}$$ The average length with this encoding is $$\frac12\left(\frac{2}{4} + \frac{3}{6} + \frac{4}{12} + \frac{2}{6} + \frac{3}{9} + \frac{5}{18} + \frac{4}{12} + \frac{4}{18} + \frac{5}{36}\right) \approx 1.486$$ which is already less than $1.5$.

Encoding even more characters at one time, you get even more close to the theoretical optimum, $-\sum_k p_k \log_2 p_k \approx 1.46$.

celtschk
  • 44,527
  • 1
    Arithmetic coding, for example, is asymptotically better than 2-symbol Huffman coding but I'm not sure it is described by "you must encode combinations of several symbols at once". – JiK Dec 19 '17 at 12:34
  • In a "real world" application, unless you know your symbols always come in pairs, presumably you'd have to add 3 more symbols for "Terminal A", "Terminal B" and "Terminal C", which would slightly increase the average? – TripeHound Dec 19 '17 at 13:39
  • 1
    @JiK, well, doesn't arithmetic coding in a sense encode all the symbols as one block... – ilkkachu Dec 19 '17 at 13:43
  • @JiK: Arithmetic coding definitely doesn't encode the symbols separately. For one, you cannot generally select any bit, and point to a single symbol to whose encoding it belongs. Also, you cannot implement is as a function that takes one symbol and spits out a sequence of bits, without storing extra information between calls, and code that calls that function for each symbol in turn. – celtschk Dec 19 '17 at 20:13
  • @TripeHound: In the real world, there's usually the length of the original file stored somewhere. But even if not, changing the encoding to contain extra codes of which just one is used just once in the entire file would be wasteful. Instead extend odd length with the most common symbo (in this case, $A$)l, and add a single extra bit at the end of the data which tells whether the last encoded symbol shall be discarded. As an optimization, only add the extra bit if the last encoded symbol is $A$ (as otherwise it must be real anyway). – celtschk Dec 19 '17 at 20:16