Lossless data compression must make some messages longer?

Question

I read on Wikipedia and in lecture notes that if a lossless data compression algorithm makes a message shorter, it must make another message longer.

E.g. In this set of notes, it says:

Consider, for example, the 8 possible 3 bit messages. If one is compressed to two bits, it is not hard to convince yourself that two messages will have to expand to 4 bits, giving an average of 3 1/8 bits.

There must be a gap in my understand because I thought I could compress all 3 bit messages this way:

Encode: If it starts with a zero, delete the leading zero.
Decode: If message is 3 bit, do nothing. If message is 2 bit, add a leading zero.
Compressed set: 00,01,10,11,100,101,110,111

What am I getting wrong? I am new to CS, so maybe there are some rules/conventions that I missed?

score 7 · Accepted Answer · edited Feb 19 '13 at 23:28

You are missing an important nuance. How would you know if the message is only 2 bits, or if it's part of a bigger message? For that, you must also encode a bit that says that the message starts, and a bit that says it ends. This bit should be a new symbol, because 1 and 0 are already used. If you introduce such a symbol and then re-encode everything to binary, you will end up with an even longer code.

score 4 · Answer 2 · answered Feb 19 '13 at 21:01

I think you should consider the (binary) messages of length up to a certain value, say $n$. Then you have $2^{n+2}-1$ messages, which you have to map onto the same $2^{n+2}-1$ messages, if you don't permit "compressed" messages with more than $n$ bits. In other words, your compression is a permutation $f$. Any word $w$ with $|f(w)| < |w|$ is in a cycle of $f$, which contains another word $w': |f(w')| > |w'|$.

Lossless data compression must make some messages longer?

2 Answers2