How does compression before encryption leak info about the input?

Question

Apparently current best practices recommend that you do not compress before you encrypt.

For example in this blog entry (*):

http://sockpuppet.org/blog/2013/07/22/applied-practical-cryptography/

It is written:

Developers shouldn’t compress plaintext before encrypting.

Or in this question here on crypto.se:

Is compressing data prior to encryption necessary to reduce plaintext redundancy?

several people answer that compressing before encryption is actually harmful.

What I really don't understand is that apparently the explanation is that the compression algorithms used leak information about the size of the plaintext input!?

But in which way does not using compression before encryption not leak info about the length of the input too?

What additional infos are leaked by compression algorithms that wouldn't be leaked by not doing compression?

If padding is used on plaintext encrypted without compression, couldn't paddding be used too before encrypting compressed data (or data to be compressed)?

Also: is it really a given that the attacks you make possible by using compression before encryption are more of a problem than the attacks made possible by not compressing before encrypting? (the latter being the reason for which, during more than two decades, it was always advised to compress before encryption).

Richie Frame · Answer 1 · 2014-08-22T19:05:34.970

The problem is not with compression and encryption, it is with the protocol that is being used, and the type of data being compressed (or not) prior to encryption.

The most damning leaks are on protocols that were either designed to be compressed without encryption, or encrypted without compression.

The best example I have is VOIP systems that use a variable bitrate compression prior to encryption. Since the gaps between words and even syllables are highly compressed, and the words and syllables themselves are compressed at different rates depending on their content, traffic pattern analysis can be used to accurately detect spoken words and phrases. Public research only began within the last decade, but NSA designed encrypted voice protocols used constant bitrate compression and i/o clocking, meaning the technique was either known or they expected it to be exploited.

An excellent presentation of the subject here, as well as a more detailed initial report from 2008.

When a protocol is properly designed with compression and encryption in mind, it is much more difficult to gain information. For example, an SMS message where the entire message space is encrypted using AES-GCM, and the text is first compressed, allowing potentially more than 160 characters to be included in a single message. Some of the SMS must be used for nonce/auth tag, so the more message it can fit the better. This is of course an additional encryption I am talking about and not part of the SMS protocol. Multiple sequential SMS messages reveal information about the message size (being larger than the limit), and compression can prevent that if done correctly.

Padding any compressed plaintext to a multiple of some given number (say 128 bytes) prior to encryption makes certain attacks less effective, especially if the majority of plaintexts are small (like text messages). The methods used to pad the data are also important, as there are attacks that exploit padding used in various protocols.

score 4 · Answer 2 · edited Mar 17 '17 at 13:14

Read about the CRIME and BREACH attacks. They are the classic example where compression before encryption can leak information about the input. The length of the compressed data leaks information about the contents of the data itself.

See also https://security.stackexchange.com/q/19911/971 and https://security.stackexchange.com/q/20406/971 and https://security.stackexchange.com/q/39925/971 for detailed discussion of those attacks over on IT Security.SE.

How does compression before encryption leak info about the input?

2 Answers2

Linked