22

I just came across the following thing: I put multiple identical copies of a png image into a folder and then tried to compress that folder with the following methods:

  • tar czf folder.tar.gz folder/
  • tar cf folder.tar folder/ && xz --stdout folder.tar > folder.tar.xz (this one works well for identical images, however for similar images the gain is zero)
  • zip -r folder.zip folder/

When I checked the size of the .tar.gz, .tar.xz, .zip I realized that it is almost the same as the one of folder/.
I understand that a png image itself may have a high level of compression and therefore cannot be compressed further. However when merging many similar (in this case even identical) png images to an archive and then compressing the archive I would expect the required size to decrease markedly. In the case of identical images I would expect a size of roughly the size of a single image.

bright-star
  • 103
  • 2
a_guest
  • 323
  • 1
  • 2
  • 6

7 Answers7

34

Have a look at how compression algorithms work. At least those in the Lempel-Ziv family (gzip uses LZ77, zip apparently mostly does as well, and xz uses LZMA) compress somewhat locally: Similarities that lie far away from each other can not be identified.

The details differ between the methods, but the bottom line is that by the time the algorithm reaches the second image, it has already "forgotten" the beginning of the first. And so on.

You can try and manually change the parameters of the compression method; if window size (LZ77) resp. block/chunk size (later methods) are at least as large as two images, you will probably see further compression.


Note that the above only really applies if you have identical images or almost identical uncompressed images. If there are differences, compressed images may not look anything alike in memory. I don't know how the PNG compression works; you may want to check the hex representations of the images you have for shared substrings manually.

Also note that even with changed parameters and redundancy to exploit, you won't get down to the size of one image. Larger dictionaries mean larger code-word size, and even if two images are exactly identical you may have to encode the second one using multiple code-words (which point into the first).

D.W.
  • 167,959
  • 22
  • 232
  • 500
Raphael
  • 73,212
  • 30
  • 182
  • 400
24

Why this happens. There are actually two different effects happening here:

  • Each file compressed independently. Some archive programs -- including zip -- compress each file independently, with no memory from one file to another file. In other words, each file is separately compressed, then the compressed files are concatenated into an archive.

  • Short-term memory. Some archive programs can use information about one file to help compress the next file better. They effectively concatenate the files, then compress the result. This is an improvement.

    See also Nayuki's answer for more discussion of this.

    However, there's a second problem. Some compression schemes -- including zip, gzip, and bzip2 -- have a limited memory. They compress the data on-the-fly, and remember the past 32KB of data, but they don't remember anything about data that occurred much earlier in the file. In other words, they can't find duplicated data if the duplicates occur farther than 32KB apart. As a result, if the identical files are short (shorter than about 32KB), the compression algorithm can remove the duplicated data, but if the identical files are long, the compression algorithm gets hosed and becomes worthless: it can't detect any of the duplicate in your data. (Bzip remembers the past 900KB or so of data, instead of 32KB.)

    All standard compression algorithms have some maximum memory size, beyond which they fail to detect patterns... but for some, this number is much larger than others. For Bzip, it's something like 900KB. For xz, it's something like 8MB (with default settings). For 7z, it's something like 2GB. 2GB is more than large enough to recognize the duplicated copies of PNG files (which are typically far smaller than 2GB). Additionally, 7z also tries to be clever about placing files that are likely to be similar to each other next to each other in the archive, to help the compressor work better; tar doesn't know anything about that.

    See also Raphael's answer and Nayuki's answer for more explanation of this effect.

How this applies to your setting. For your specific example, you are working with PNG images. PNG images are themselves compressed, so you can think of each PNG file as basically a sequence of random-looking bytes, with no patterns or duplication within the file. There's nothing for a compressor to exploit, if it looks at a single PNG image. Thus, if you try to compress a single PNG file (or create a zip/tar/... archive containing just a single PNG file), you won't get any compression.

Now let's look at what happens if you try to store multiple copies of the same PNG file:

  • Small files. If the PNG file is very small, then everything except for zip will work great. Zip will fail spectacularly: it compresses each file independently, so it has no chance to detect the redundancy/duplication among the files. Moreover, as it tries to compress each PNG file, it achieves no compression; the size of a zip archive will be huge. In contrast, the size of a tar archive (whether compressed with gzip, bzip2, or xz) and a 7z archive will be small, as it basically stores one copy of the file and then notices that the others are all identical -- they benefit from retaining memory from one file to another.

  • Large files. If the PNG file is large, then only 7z works well. In particular, zip continues to fail spectacularly. Also, tar.zip and tar.bzip2 fail badly, as the size of the file is larger than the compressor's memory window: as the compressor sees the first copy of the file, it can't shrink it (since it has already been compressed); by the time it starts to see the beginning of the second copy of the file, it has already forgotten the byte sequences seen at the beginning of the first file and can't make the connection that this data is actually a duplicate.

    In contrast, tar.xz and 7z continue to do great with multiple copies of a large PNG file. They don't have the "small memory size" limitation and are able to notice that the second copy of the file is identical to the first copy, so there's no need to store it a second time.

What you can do about this. Use 7z. It has a bunch of heuristics that will help detect identical or similar files and compress really well in that case. You can also look at lrzip with lzop compression.

How do I know? I was able to verify this by trying some experiments with 100 copies of a file containing random bytes. I tried 100 copies of a 4KB file, 100 copies of a 1MB file, and 100 copies of a 16MB file. Here's what I found:

Size of file      Size of compressed archive (with 100 copies)
                  zip  tar.gz  tar.bz2  tar.xz    7z
         4KB    414KB     8KB     10KB     5KB    5KB
         1MB    101MB   101MB    101MB     1MB    2MB
        16MB    1.6G    1.6GB    1.6GB   1.6GB  401MB

As you can see, zip is horrible no matter how small your file is. 7z and xz are both good if your images aren't too large (but xz will be fragile and dependent on the order in which images get placed in the archive, if you have some duplicates and some non-duplicates mixed together). 7z is pretty darn good, even for large files.

References. This is also explained well in a bunch of posts over at Super User. Take a look:

D.W.
  • 167,959
  • 22
  • 232
  • 500
12

Firstly, note that the PNG image format is basically raw RGB pixels (with some light filtering) pushed through the DEFLATE compression format. Generally speaking, compressed files (PNG, JPEG, MP3, etc.) will see no benefit from being compressed again. So for practical intents, we can treat your PNG file as incompressible random data for the rest of the experiment.

Second, note that ZIP and gzip formats also use the DEFLATE codec. (This would explain why zipping versus gzipping a single file will produce essentially the same output size.)


Now allow me to comment on each test case individually:

  • tar czf folder.tar.gz folder/

    This creates a (uncompressed) TAR file that concatenates all your identical PNG files (with a tiny amount of metadata and padding added). Then this single file is sent through the gzip compressor to create one compressed output file.

    Unfortunately, the DEFLATE format only supports an LZ77 dictionary window of 32768 bytes. So even though the TAR contains repetitive data, if your PNG file is greater than 32 KiB then for sure the DEFLATE compressor cannot remember data far enough back to take advantage of the fact that identical data is recurring.

    On the other hand, if you retry this experiement with, say, a 20 KB PNG file duplicated 10 times, then it is very likely you will get a gzip file only slightly bigger than 20 KB.

  • tar cf folder.tar folder/ && xz --stdout folder.tar > folder.tar.xz

    This creates a TAR file just like before, and then uses the xz format and LZMA/LZMA2 compressor. I couldn't find information about LZMA in this situation, but from 7-Zip for Windows I know it can support big dictionary window sizes (e.g. 64 MiB). So it is possible that you were using suboptimal settings, and that the LZMA codec might have been able to reduce the TAR file to just the size of one PNG file.

  • zip -r folder.zip folder/

    The ZIP format does not support "solid" archives; that is to say, every file is compressed independently. We assumed every file is incompressible. Hence the fact that every file is identical cannot be exploited, and the ZIP file will be as big as the straight concatenation of all the files.

Nayuki
  • 880
  • 6
  • 16
7

The problem is, that (most) compression schemes lack the knowledge over the data you have. Even if you decompress your PNGs to bitmaps and compress them in the tarball, you would not get (significantly) smaller results.

In the case of many similar images, a appropriate compression scheme would be a video codec.

Using lossless coding you should achieve nearly the perfect compression result you are expecting.

If you want to test it, use something like this:

ffmpeg -i img%03d.png -c:v libx264 -c:v libx264 -profile:v high444 -crf 0 out.mp4

https://trac.ffmpeg.org/wiki/Create%20a%20video%20slideshow%20from%20images

Jonas
  • 71
  • 1
5

PNG is the combination of Filters+LZ77+Huffman (the combination of LZ77+Huffman is called Deflate) in that order:

step 1) if the filter is different from None, the value of the pixels are replaced by the difference from the adjacent pixels (for more details see http://www.libpng.org/pub/png/book/chapter09.html) . That increases the compression of images with gradients (so ...4 5 6 7 becomes ...1 1 1 1) and it may help in areas of the same color (...3 3 3 5 5 5 5 5 becomes 0 0 0 2 0 0 0 0 0). By default filters are enabled in 24-bits images and disabled in 8-bits images with a palette.

step 2) the data is compressed with LZ77 that replaces repeated (matches) strings of bytes with a tuple containing the distance to the match and the length of the match.

step 3) the result of step 2 is encoded with Huffman code that replaces fixed-length symbols with variable-length codes, the more frequent the symbol the shorter the code.

There are multiple issues:

A small change that affects few pixels will result in changes in the results from the 3 steps of png compression:

1) The filtered value of adjacent pixels will change (depending on the filter used). That will amplify the effects of small changes.

2) The change will mean that matches to that area will be different. For example changing 333333 to 333533 causes that another occurrence of 333333 will not longer match so it will select another match to 333333 with a different distance or it will select the same match but with a shorter length and then another match for the last 3 bytes. By itself that will change the results a lot.

3) The largest issue is in step 3. The huffman code use a variable number of bits so even a small change will result in that everything that follows is not aligned any longer. AFAIK Most compression algorithms can't detect matches that are not byte aligned so that will prevent (or at least reduce a lot) compression on the already compressed data that follows the change unless the compressor can detect matches that are not byte aligned.

The other issues are already covered by other replies:

4) Gzip uses the same Deflate algorithm with a 32KB dictionary, so if the png files are larger than 32KB the matches will not be detected even if they are identical. Bzip2 is better in that aspect as it uses a 900 KB block. XZ uses LZMA, which IIRC has a 4 MB dictionary in the default compression level. 5) Zip format doesn't use solid compression so it will not compress similar or identical files any better.

Perhaps compressors from the PAQ or PPMD family will compress better but if you need to compress lots of similar image files then you can consider 3 approaches:

1) Store the images uncompressed (with PNG -0 or in a format without compression) and compress with a compressor with a large dictionary or block size. (LZMA will work well)

2) Another option would be keep the filters but remove the Deflate compression from the PNGs. That can be done for example with the (AdvDef) utility. Then you compress the resulting uncompressed PNGs. After decompression you can keep the uncompressed PNG or compress them again with AdvDef (but that will take time).

You need to test both approaches to see which compresses the most.

3) The last option would be converting the png images in a video, compress it with a lossless video compressor like x264 lossless (taking special care of using the right color format) and then on extraction extract the frames to individual png images. That can be done with ffmpeg. You also would need to keep the mapping between the frame number and the original name.

That would be the most complex approach but if the pngs are all part of a animation it may be the most effective. However you will need a video format that supports transparency if you need it.

Edit: There is also MNG format would it is not used often.

ggf31416
  • 151
  • 4
2

The PNG file format already uses the DEFLATE compression algorithm internally. This is the same algorithm as used by xz, gzip, and zip - just in some variations. tar.gz and and tar.xz take advantage of similarity between files, which zip does not.

So, in fact, you perform DEFLATE compression over DEFLATE compressed files - this is why the files keep almost the original size.

The bzip2 program (also a related algorithm) is better when it comes to (nearly) identical files.

# for i in $(seq 4); do cp test.png test$i.png; done
# tar -cjf archive.tar.bz2 *.png
# ls -l
-rw-r--r-- 1 abcde users  43813 15. Jul 08:45 test.png
-rw-r--r-- 1 abcde users  43813 15. Jul 08:45 test1.png
-rw-r--r-- 1 abcde users  43813 15. Jul 08:46 test2.png
-rw-r--r-- 1 abcde users  43813 15. Jul 08:46 test3.png
-rw-r--r-- 1 abcde users  43813 15. Jul 08:46 test4.png
-rw-r--r-- 1 abcde users  68115 15. Jul 08:47 archive.tar.bz2
rexkogitans
  • 227
  • 2
  • 6
2

When you have special datasets you use special algorithms, not multipurpose tools.

The answer is that your chosen lossless compressions aren't made for what you do. Noone expects you to compress the same image twice, and even if you do it (by accident) checking against all previous input would make your algorithm O(n^2) (maybe a bit better, but the naiv approach atleast would be n^2).

Most of your compression programms you tested on run in O(n), they emphazise speed over optimal compression ratio. Noone wants to run his computer for 5 hours just to spare a few mb's, especially these days. For larger inputs anything above O(n) becomes an issue of runtime.

Another issue is ram. You can't access every part of your input at any point in time, when the input gets big enough. Even disregarding this, most people don't want to give up their whole ram or cpu just to compress something.

If you have patterns in your files that you want to compress, you will have to do manuel operations on them, write your own compression or potentially use an "archive"-type-compression (nano). A compression for longterm storage, that is too slow for everyday use.

Another option potentially would be a lossless video compression.