6

I have been experimenting with LZ77 (naively $O(n^2)$ runtime, infinite window). Applying it to the 7th Fibonacci word $abaababaabaab$ yields the correct LZ factorization:

$\qquad a,b,a,aba,baaba,ab$.

My question is about the behavior of LZ77 if we iterate it. My experiments suggest that reapplication of LZ77 to the input will yield no further patterns that were not found the first time.

By reapplication I mean, where in the first instance we treat the factors of the string as the sequence of unit symbols 'a' and 'b', in the second application the factors are the LZ factors. I was hoping to discover (over larger various texts, like the Complete Sonnets of Shakespeare) increasing gains, and possibly, "multilevel" patterns found by LZ over the sequence of factors of the previous iterate. But none of this occurred. The sequence of factors after the second iteration is exactly the same as the first.

So where is the bug in my thinking? Is there a simple proof of this given the definition of an LZ factor being the longest prefix from the current position occurring in the concatenation of the preceding LZ factors?

Raphael
  • 73,212
  • 30
  • 182
  • 400

2 Answers2

8

Lempel and Ziv proved that under some reasonable assumptions, the limiting rate of their algorithm is equal to the entropy of the text. That means that in the limit, the output should be completely random. Random text cannot be compressed (on average), so you should expect that if you take a long text and apply Lempel-Ziv twice, then the second time wouldn't compress the text at all.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
1

You have to be careful about what you mean by iterative. Did you apply this to a sequence of concatenations of the source string? Because that will certainly increase the dictionary. There might also be some interesting optimal intersections (from the coding perspective of the optimal prefix-free size of the pattern encoding vs the individual optimal prefix symbol encodings) of pattern sizes discovered by applying it forwards and backwards instead of iteratively.

Rob
  • 11
  • 1