Compression algorithms for small strings - building upon / extending Huffman

Question

I'm exploring algorithms to compress small strings like the following - every line is to be compressed individually, i.e. even those small strings should get compressed:

ZYXH1|104932,104932,927.6000,200,1,927.4000,1600,1
ZYXH1|104932,104932,3390,1600,1,3389,2700,1
HXDFF|0001739.42;HXDFF:Y|0U010:49:32
ZYXH1|104932,104932,5360,5400,1,5350,16400,1
ZYXH1|104931,104931,1966,2800,1,1965,3800,1;HXYFBR|0003.12;HXCFBR|0001.14;HXCFER|0006.7;HXMKCF|0003268409320112.0
ZYXH1|104932,104932,526,192400,1,525,88400,1
ZYXH1|104932,104932,803,6300,1,802,63900,1
ZYXDFF|803;ZYXDFF:Y|104932;ZYXVW|100800.6381;ZYXVL|1121700,104932;TAXTIC|L719;TAXNO|L1121700
ZYXH1|104932,104932,1367,2100,1,1366.5000,2800,1
HXYFBR|0002.04;HXCFBR|0001.52;HXCFER|00015.0;HXMKCF|0001676300748887.4
ZYXH1|104932,104932,380,63500,1,379,21500,1
ZYXH1|104932,104932,5360,5400,1,5350,16400,1
ZYXH1|104932,104932,803,3900,1,802,63800,1;ZYXVA|898075800
ZYXH1|104932,104932,3045,4700,1,3040,5500,1;ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.4356;ZYXVL|164600,104932;ZYXVA|498213500;TAXTIC|L199;TAXTICD|L42;TAXNO|L164600;TAXDFG|Y0058
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.3304;ZYXVL|169800,104932;TAXTIC|L202;TAXNO|L169800
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2935;ZYXVL|171700,104932;TAXTIC|L206;TAXNO|L171700
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2764;ZYXVL|172600,104932;HXCFF|0003040;HXCFF:T|0T010:49;TAXTIC|L212;TAXNO|L172600
ZYXDFF|3040;ZYXDFF:Y|104932;ZYXVW|1003043.2745;ZYXVL|172700,104932;TAXTIC|L213;TAXNO|L172700
ZYXH1|104932,104932,4168,1200,1,4167,100,1

I have a couple of megabytes worth of data like the above, which allows me to create a static statistical model, i.e. for Huffman. If I have such a static model, I can embed that with the decompressor, removing the need to transmit/store it, i.e. trading compression rate for adaptability (important with these small strings).

I have now tried various algorithms, and the results of compressing those megabytes worth of strings (using a simple program that transmits the compressed data via TCP) are as follows:

 Algorithm       Rate  Bytes/sec
--------------------------------
 deflate optim   46.7%    3724917
 deflate fast    44.2%    3831672
 huffman         41.9%   11169462
 lz4 1576 optim  25.6%    4334541
 lz4 18576 fast  23.5%   42504338
 lz4 1576 fast   23.4%   70996590
 uncompressed       0%  105210881

So deflate is at the top of the list when it comes to the compression rate, (even for these small strings, where deflate does not rely on a shared static model) but it's 2 to 3 times slower than (my home-grown) Huffman coder. Given all the repeating character patterns in the input data, it would seem one could do much better than the Huffman coder which only looks at single bytes. LZ4 can be really fast with small buffer sizes (I only listed sizes of 1576 and 18576, Bytes/sec for bigger sizes drops significantly), but compression rate is not that good.

Are there simple ways to change / extend a per-byte Huffman coder to make use of the fact that there are many repeating (multi-byte) character patterns?

But it shouldn't grow the statistical model to thousands of entries (currently the model has at most 256 symbols, and merely changing from 1-grams to 2-grams would increase it to up to 32768), because in fact I do want to change and hence store/transmit the statistical model, but not for every string, but only every, say, 1 megabyte or so. So it would be somewhere between a static statistical model and a fully dynamic statistical model.

score 4 · Answer 1 · answered Nov 15 '18 at 10:03

A simple variant of Huffman is due, I believe, to David Wheeler.

Suppose the alphabet is $\Sigma = \{s_1, \dots, s_n\}$ and let $\star$ be some new character that's not in $\Sigma$. For each character $s\in\Sigma$, let $p(s)$ be the most common character to occur immediately after $s$ in your dataset.

To compress the string $X=x_1\dots x_N$, you first scan through it and, each time $x_{i+1}=p(x_i)$, replace $x_{i+1}$ with $\star$, giving a new string over alphabet $\Sigma\cup\{{\star}\}$. Then run ordinary Huffman on the resulting string. The idea is that the character $\star$ should be really common in the modified string, so it should get a really short codeword.

For example, if your alphabet is $\{a,b,c,d\}$ and $p(a)=a$, $p(b)=a$, $p(c)=d$ and $p(d)=c$, then the string $abacd$ would be transformed to $ab{\star}c{\star}$ and $cdcdcd$ would be transformed to $c{\star}{\star}{\star}{\star}{\star}$.

Since all your example strings begin with Z, it might also be worth storing the most common initial character to allow you to replace the first character of the string with $\star$, too.

Arundale Ramanathan · Answer 2 · 2020-02-14T09:08:46.087

Check out these libraries for compressing short strings:

https://github.com/siara-cc/unishox :

Unishox is a hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter of the 95 letter printable Character Set (entropy coding). It encodes repeating letter sets separately (dictionary coding). For Unicode characters (UTF-8), delta coding is used. It also has special handling for repeating upper case and num pad characters.

Unishox was developed to save memory in embedded devices and compress strings stored in databases. It is used in many projects and has an extension for Sqlite database. Although it is slower than other available libraries, it works well for the given applications.

https://github.com/antirez/smaz :

Smaz was developed by Salvatore Sanfilipo and it compresses strings by replacing parts of it using a codebook. This was the first one available for compressing short strings as far as I know.

https://github.com/Ed-von-Schleck/shoco :

shoco was written by Christian Schramm. It is an entropy encoder, because the length of the representation of a character is determined by the probability of encountering it in a given input string.

It has a default model for English language and a provision to train new models based on given sample text.

PS: The first one was developed by me and its working principle is explained in this article:

Comparison: I compressed the first line of your data and posted the screenshot below:

I also compressed the whole text using Unishox and it gives a savings of 47.8%:

Compression algorithms for small strings - building upon / extending Huffman

2 Answers2