5

So I'm trying to find a method of encryption that not only obfuscates text, but also compresses the result.

For example, if I encrypted ninechars, the ideal result would be less than nine characters.

Otherwise, if this is not possible, is there a method where I can set a limit on the amount of characters in the result? In other words, no matter how long the input is, the result will always be 50 characters long.

The closest results that I have found are Caesar Encrypt and XOR Encrypt, but they seem to simply result in the same amount of characters as the original input. The others I have found increase the size.

TL;DR I pretty much need a "hash function" that is reversible.

Grant Miller
  • 183
  • 2
  • 8
  • 18

4 Answers4

16

So I'm trying to find a method of encryption that not only obfuscates text, but also compresses the result. For example, if I encrypted ninechars, the ideal result would be less than nine characters.

Even without encryption, it's not possible for a reversible data compression scheme to shorten all of its inputs. This can be easily proven using the pigeonhole principle: for any given length $\ell$, there are fewer strings of length strictly less than $\ell$ than there are strings of length at most $\ell$. Thus, if the compression scheme maps all strings of length at most $\ell$ to strings of length less than $\ell$, it must necessarily map some distinct input strings to the same output string, and thus be irreversible.

(In fact, a similar argument can be used to show a stronger result, namely that, if a lossless compression scheme makes any input string shorter, then it must also make some input string longer. I'll leave proving that as an exercise.)

Otherwise, if this is not possible, is there a method where I can set a limit on the amount of characters in the result? In other words, no matter how long the input is, the result will always be 50 characters long. TL;DR I pretty much need a "hash function" that is reversible.

This is also impossible, by a similar argument. There is a finite, fixed number of strings of length 50 (namely, $k^{50}$ for a $k$-letter alphabet), whereas the number of possible input strings of unbounded length is infinite. Thus, not only does any function mapping arbitrary unbounded inputs to 50-character outputs need to map some distinct inputs to the same output, but it actually has to map infinitely many inputs to some output.

In fact, to show that such a function cannot be reversible, it's enough to consider only inputs that are 51 characters long. Clearly, there are more 51-character inputs than there are 50-character outputs, so some distinct inputs have to map to the same output.


Of course, if you allow some inputs to increase in size, then you can make other inputs shorter. This is basically what ordinary data compression algorithms like LZW do — they rely on the fact that most of the possible inputs to the compressor look essentially like random noise, whereas most of the typical inputs (like plain text, program code, uncompressed image or audio data, etc.) have a lot of repetitive, non-random structure. Such repetitive data can be encoded more compactly, making it a lot shorter; meanwhile, if the input does not happen to be repetitive enough to compress well, the compressor will just insert a short marker (at a minimum, a single bit, but in practice usually a few bytes) indicating that the data could not be compressed, and then include the input verbatim. Thus, the compressor can compress some common types of inputs significantly, at the cost of very slightly increasing the length of the random-looking inputs that make up the bulk of the full input space.

Anyway, none of this really has anything to do with cryptography. If you want to both compress and encrypt data, the standard way is to first compress it, and then encrypt it. (You cannot compress it after encryption, since encrypted data does look random to anyone who doesn't know the key, and has no apparent structure that a compressor could exploit.)

As other answers have pointed out, though, even this standard method has risks. One risk comes from the fact that any encryption scheme that can handle arbitrarily long messages must necessarily reveal at least some information about the length of the message. This may allow an attacker to, say, distinguish the message YES (three characters) from the message NO (two characters).

This is true even without compression, and needs to be kept in mind when designing any cryptosystem, but adding compression to the mix makes it harder to predict and defend against. For example, even if you were careful to always use fixed-length messages (like, say, POSITIVE and NEGATIVE instead of YES and NO), it's quite likely that running such messages through a compressor will produce output whose length varies.

More importantly, if the attacker can control some of the data being compressed and encrypted, they may be able to learn information about the other parts by observing the length of the compressed message. For example, let's say the attacker can make you generate, compress and encrypt messages of the form TO <ID>: STATUS <POSITIVE/NEGATIVE>, where <ID> can be controlled by the attacker. The attacker may then be able to request two encrypted messages, one for <ID> = POSITIVE and one for <ID> = NEGATIVE, and see which one compresses better.

Of course, these are all just toy examples, but similar weaknesses have led to real attacks. For example, as mentioned in otus's answer, the CRIME and BREACH attacks on SSL are based on the same principle as the chosen-plaintext toy attack described in the previous paragraph. Even the passive attack described earlier above, based on just observing naturally occurring message lengths, may be used to e.g. eavesdrop on encrypted VoIP conversations.

That said, the lesson to take home here is not just that data compression is dangerous and should be avoided. Rather, it's that the leakage of plaintext length is dangerous, and, since it cannot be completely avoided, must be kept in mind when analyzing any crypto protocol. Compression is an additional complication on top of that, since it can make the message length depend on its content in non-trivial and non-local ways.

Ilmari Karonen
  • 46,700
  • 5
  • 112
  • 189
9

The way this is usually done is to use a separate compression algorithm, then encrypt the compressed (shorter) message.

However, compression has some disadvantages and nowadays its use is discouraged. Compression can leak information about the plaintext, like in CRIME and BREACH attacks on TLS. Arguably it is the protocol that combines the compression and encryption that is to blame, but a generic compression algorithm (like anything based on lz77) allows an attack if the attacker can control a part of the message to be encrypted.

Of course, you cannot compress every message, only those with redundancy. And you cannot ensure they will always fit 50 "characters" (which you should define), unless you are willing to truncate long messages.

otus
  • 32,462
  • 5
  • 75
  • 167
2

If the $plaintext$ is $n$ bits the $ciphertext$ should be atleast $n$ bits, otherwise there would be information loss and you cannot get the $plaintext$ back !In simpler terms , encryption is randomized mapping from $plaintext$ to $ciphertext$ if both are $n$ bits its a random permutation of bits. Ideally it is suggested for good security that $ciphertext$ should be larger than $plaintext$ to ensure enough entropy. If you have for some reason fixed the $ciphertext$ length and either the $plaintext$ length should be either equal or smaller but definitely cannot be greater.

On a side note , you cannot compress after encryption too, A good encryption algorithm's goal is to make $ciphertext$ as random as possible. So it defeats any compression algorithms.

You could compress and then encrypt, but that is considered harmful in certain scenarios , check here.

sashank
  • 6,234
  • 4
  • 36
  • 68
1

You can simply approach it in two separate steps: Compress the data, then encrypt it. But compression won't do much with short data, like your nine-character example. In fact that one will probably get longer due to some overhead. Compression needs a reasonable amount of data so that it can find patterns that repeat. Otherwise, the best it can do is trim some bits off your ASCII characters.

Regarding, "no matter how long the input is, the result will always be 50 characters long" ... I could write you an encryption program that does that, for free. But you can't afford the decryption utility. Ha!

Seriously, you can't compress data at-will, unless you're willing to live with losses. For example, you can compress a JPG pic or MP3 song to configurable degrees, but it's "lossy"... i.e. you lose data. (The JPG gets blurry/pixelated. The MP3 sounds like a bad robot singing.) It would be interesting if someone's created a lossy text compressor, but I imagine it would just skip some letters and hope you can still read it.