4

I read on Wikipedia and in lecture notes that if a lossless data compression algorithm makes a message shorter, it must make another message longer.

E.g. In this set of notes, it says:

Consider, for example, the 8 possible 3 bit messages. If one is compressed to two bits, it is not hard to convince yourself that two messages will have to expand to 4 bits, giving an average of 3 1/8 bits.

There must be a gap in my understand because I thought I could compress all 3 bit messages this way:

  • Encode: If it starts with a zero, delete the leading zero.
  • Decode: If message is 3 bit, do nothing. If message is 2 bit, add a leading zero.
  • Compressed set: 00,01,10,11,100,101,110,111

What am I getting wrong? I am new to CS, so maybe there are some rules/conventions that I missed?

Gilles 'SO- stop being evil'
  • 44,159
  • 8
  • 120
  • 184
Legendre
  • 141
  • 2

2 Answers2

7

You are missing an important nuance. How would you know if the message is only 2 bits, or if it's part of a bigger message? For that, you must also encode a bit that says that the message starts, and a bit that says it ends. This bit should be a new symbol, because 1 and 0 are already used. If you introduce such a symbol and then re-encode everything to binary, you will end up with an even longer code.

svick
  • 1,876
  • 14
  • 15
Shaull
  • 17,814
  • 1
  • 41
  • 67
4

I think you should consider the (binary) messages of length up to a certain value, say $n$. Then you have $2^{n+2}-1$ messages, which you have to map onto the same $2^{n+2}-1$ messages, if you don't permit "compressed" messages with more than $n$ bits. In other words, your compression is a permutation $f$. Any word $w$ with $|f(w)| < |w|$ is in a cycle of $f$, which contains another word $w': |f(w')| > |w'|$.

frafl
  • 2,339
  • 1
  • 17
  • 32