23

I have recently stumbled upon the following interesting article which claims to efficiently compress random data sets by always more than 50%, regardless of the type and format of the data.

Basically it uses prime numbers to uniquely construct a representation of 4-byte data chunks which are easy to decompress given that every number is a unique product of primes. In order to associate these sequences with the primes it utilizes a dictionary.

My question is:

  • Is this really feasible as the authors suggest it? According to the paper, their results are very efficient and always compress data to a smaller size. Won't the dictionary size be enormous?
  • Couldn't this be used to iteratively re-compress the compressed data using the same algorithm? It is obvious, and has been demonstrated, that such techniques (where the compressed data is re-compressed as many times as possible, dramatically reducing the file size) are impossible; indeed, there would be no bijection between the set of all random data and the compressed data. So why does this feel like it would be possible?
  • Even if the technique is not perfect as of yet, it can obviously be optimized and strongly improved. Why is this not more widely known/studied? If indeed these claims and experimental results are true, couldn't this revolutionalize computing?
Klangen
  • 1,100
  • 8
  • 15

5 Answers5

35

always compress random data sets by more than 50%

That's impossible. You can't compress random data, you need some structure to take advantage of. Compression must be reversible, so you can't possibly compress everything by 50% because there are far less strings of length $n/2$ than there are of length $n$.

There are some major issues with the paper:

  • They use 10 test files without any indication of their content. Is the data really random? How were they generated?

  • They claim to achieve compression ratios of at least 50%, while their test data shows they achieve at most 50%.

This algorithm defines a lossless strategy which makes use of the prime numbers present in the decimal number system

  • What? Prime numbers are prime numbers regardless of the base.

  • Issue #1 with decompression: prime factorization is a hard problem, how do they do it efficiently?

  • Issue #2 with decompression (this is the kicker): they multiply the prime numbers together, but doing so you lose any information about the order, since $2\cdot 5 = 10 = 5\cdot 2$. I don't think it is possible to decompress at all using their technique.

I don't think this paper is very good.

Tom van der Zanden
  • 13,493
  • 1
  • 39
  • 56
16

I'm going to defer to Tom van der Zanden who seems to have read the paper and discovered a weakness in the method. While I didn't read the paper in detail, going from the abstract and the results table, it seems like a broadly believable claim.

What they claim is a consistent 50% compression ratio on text files (not "all files"), which they note is around the same as LZW and about 10% worse than (presumably zero-order) Huffman coding. Compressing text files by 50% is not hard to achieve using reasonably simple methods; it's an undergraduate assignment in many computer science courses.

I do agree that the paper isn't very good as published research, and I don't think it speaks well of the reviewers that this was accepted. Apart from the obvious missing details that makes the results impossible to reproduce (e.g. what the text files were), and no attempt to tie it into the field of compression, there is no sense that they really understand what their algorithm is doing.

The conference web site claims a 1:4 acceptance ratio, which makes you wonder what they rejected.

Pseudonym
  • 24,523
  • 3
  • 48
  • 99
12

You ask:

  • Is this really feasible as the authors suggest it? According to the paper, their results are very efficient and always compress data to a smaller size. Won't the dictionary size be enormous?

Yes, of course. Even for their hand-picked example ("THE QUICK SILVER FOX JUMPS OVER THE LAZY DOG"), they don't achieve compression, because the dictionary contains every 4-byte substring of the text (minus 4 bytes for the one repetition of "THE")... and the "compressed" version of the text has to include the whole dictionary plus all this prime number crap.

  • Couldn't this be used to iteratively re-compress the compressed data using the same algorithm? It is obvious, and has been demonstrated, that such techniques (where the compressed data is re-compressed as many times as possible, dramatically reducing the file size) are impossible; indeed, there would be no bijection between the set of all random data and the compressed data. So why does this feel like it would be possible?

Again you seem to have a good intuitive grasp of the situation. You have intuitively realized that no compression scheme can ever be effective on all inputs, because if it were, we could just apply it over and over to compress any input down to a single bit — and then to nothingness!

To put it another way: Once you've compressed all your .wav files to .mp3, you're not going to get any improvement in file size by zipping them. If your MP3 compressor has done its job, there won't be any patterns left for the ZIP compressor to exploit.

(The same applies to encryption: if I take a file of zeroes and encrypt it according to my cryptographic algorithm of choice, the resulting file had better not be compressible, or else my encryption algorithm is leaking "pattern" into its output!)

  • Even if the technique is not perfect as of yet, it can obviously be optimized and strongly improved. Why is this not more widely known/studied? If indeed these claims and experimental results are true, couldn't this revolutionalize computing?

These claims and experimental results are not true.

As Tom van der Zanden already noted, the "compression algorithm" of Chakraborty, Kar, and Guchait is flawed in that not only does it not achieve any compression ratio, it is also irreversible (in mathspeak, "not bijective"): there are a multitude of texts that all "compress" to the same image, because their algorithm is basically multiplication and multiplication is commutative.

You should feel good that your intuitive understanding of these concepts led you to the right conclusion instantly. And, if you can spare the time, you should feel pity for the authors of the paper who clearly spent a lot of time thinking about the topic without understanding it at all.

The file directory one level above the URL you posted contains 139 "papers" of this same quality, all apparently accepted into the "Proceedings of the International Conference on Emerging Research in Computing, Information, Communication and Applications." This appears to be a sham conference of the usual type. The purpose of such conferences is to allow fraudulent academics to claim "publication in a journal", while also allowing unscrupulous organizers to make a ton of money. (For more on fake conferences, check out this reddit thread or various StackExchange posts on the subject.) Sham conferences exist in every field. Just learn to trust your instincts and not believe everything you read in a "conference proceeding", and you'll do fine.

Quuxplusone
  • 279
  • 1
  • 7
6

Entropy effectively bounds the performance of the strongest lossless compression possible. Thus there exist no algorithm that can compress random data sets by always more than 50%.

J.-E. Pin
  • 6,219
  • 21
  • 39
1

Compression methods, that are restorable, in general find a pattern and re-express it in a simplistic way. Some are very clever, some very simple. At some point there is no pattern. The process(es) have 'boiled' the data set down to it simplest unique pattern. Any attempts at compression from that point forward result in a larger data set, or dilute the uniqueness. In magic number compression schemes there is always a flaw, or a slight of hand, or loss. be wary of any process that claims to out do the latest WinZip or RAR.

SkipBerne
  • 137
  • 3