Word factorization in $O(n^2 \log n)$ time

Question

Given two strings $S_1, S_2$, we write $S_1S_2$ for their concatenation. Given a string $S$ and integer $k\geq 1$, we write $(S)^k = SS\cdots S$ for the concatenation of $k$ copies of $S$. Now given a string, we can use this notation to 'compress' it, i.e. $AABAAB$ may be written as $((A)^2 B)^2$. Let's call the weight of a compression the number of characters appearing in it, so the weight of $((A)^2 B^2)$ is two, and the weight of $(AB)^2 A$ (a compression of $ABABA$) is three (separate $A$s are counted separately).

Now consider the problem of computing the 'lightest' compression of a given string $S$ with $|S|=n$. After some thinking there is an obvious dynamic programming approach which runs in $O(n^3 \log n)$ or $O(n^3)$ depending on the exact approach.

However, I have been told this problem can be solved in $O(n^2 \log n)$ time, though I cannot find any sources on how to do this. Specifically, this problem was given in a recent programming contest (problem K here, last two pages). During the analysis an $O(n^3 \log n)$ algorithm was presented, and at the end the pseudo quadratic bound was mentioned (here at the four minute mark). Sadly the presenter only referred to 'a complicated word combinatorics lemma', so now I have come here to ask for the solution :-)

MERTON · Answer 1 · 2019-05-19T15:37:42.703

If I'm not misunderstanding you, I think the minimum cost factorization can be calculated in $O(n^2)$ time as follows.

For each index i, we will calculate a bunch of values $(p_i^\ell, r_i^\ell)$ for $\ell=1,2,\ldots$ as follows. Let $p_i^1\ge 1$ be the smallest integer such that there is an integer $r\ge 2$ satisfying $$S[i-rp_i^1+1, i-p_i^1] = S[i-(r-1)p_i^1+1, i].$$ For this particular $p_i^1$, let $r_i^1$ be the largest $r$ with this property. If no such $p_i$ exists, set $L_i=0$ so we know there are zero $(p_i^\ell,r_i^\ell)$ values for this index.

Let $p_i^2$ be the smallest integer strictly bigger than $(r_i^1-1)p_i^1$ satisfying, likewise, $$S[i-r_i^2p_i^2+1, i-p_i^2] = S[i-(r_i^2-1)p_i^2+1, i]$$ for some $r_i^2\ge 2$. Like before, take $r_i^2$ to be the maximal one having fixed $p_i^2$. In general $p_i^\ell$ is the smallest such number strictly bigger than $(r_i^{\ell-1}-1)p_i^{\ell-1}$. If no such $p_i^\ell$ exists, then $L_i=\ell-1$.

Note that for each index i, we have $L_i=O(\log (i+1))$ due to $p_i^\ell$ values increasing geometrically with $\ell$. (if $p_i^{\ell+1}$ exists, it's not just strictly bigger than $(r_i^\ell-1)p_i^\ell$ but bigger than that by at least $p_i^\ell/2$. This establishes the geometric increase.)

Suppose now all $(p_i^\ell,r_i^\ell)$ values are given to us. The minimum cost is given by the recurrence $$\mathrm{dp}(i,j) = \min\left\{\mathrm{dp}(i, j-1) + 1, \min_\ell \left(\mathrm{dp}\left(i,j - r_j^\ell p_j^\ell\right) + \mathrm{dp}(j-r_j^\ell p_j^\ell+1,j-p_j^\ell)\right)\right\}$$ with the understanding that for $i>j$ we set $\mathrm{dp}(i,j) = +\infty$. The table can be filled in $O(n^2 + n\sum_j L_j)$ time.

We already observed above that $\sum_j L_j = O(\sum_j \log (j+1)) = \Theta(n\log n)$ by bounding the sum term by term. But actually if we look at the whole sum, we can prove something sharper.

Consider the suffix tree $T(\overleftarrow{S})$ of the reverse of $S$ (i.e., the prefix tree of S). We will charge each contribution to the sum $\sum_i L_i$ to an edge of $T(\overleftarrow{S})$ so that each edge will be charged at most once. Charge each $p_i^j$ to the edge emanating from $\mathrm{nca}(v(i), v(i-p_i^j))$ and going towards $v(i-p_i^j)$. Here $v(i)$ is the leaf of the prefix tree corresponding to $S[1..i]$ and nca denotes the nearest common ancestor.

This shows that $O(\sum_i L_i)=O(n)$. The values $(p_i^j,r_i^j)$ can be calculated in time $O(n+\sum_i L_i)$ by a traversal of the suffix tree but I will leave the details to a later edit if anyone is interested.

Let me know if this makes sense.

score -1 · Answer 2 · answered Dec 20 '18 at 14:04

There is your initial string S of length n. Here is the pseudo-code of the method.

next_end_bracket = n
for i in [0:n]: # main loop

    break if i >= length(S) # due to compression
    w = (next_end_bracket - i)# width to analyse

    for j in [w/2:0:-1]: # period loop, look for largest period first
        for r in [1:n]: # number of repetition loop
            if i+j*(r+1) > w:
                break r loop

            for k in [0:j-i]:
                # compare term to term and break at first difference
                if S[i+k] != S[i+r*j+k]:
                    break r loop

        if r > 1:
            # compress
            replace S[i:i+j*(r+1)] with ( S[i:i+j] )^r
            # don't forget to record end bracket...
            # and reduce w for the i-run, carrying on the j-loop for eventual smaller periods. 
            w = j-i

I intentionally gave little details on "end brackets" as it needs lot of steps to stack and unstack which would let the core method unclear. The idea is to test an eventual further contraction inside the first one. for exemple ABCBCABCBC => (ABCBC)² => (A(BC)²)².

So the main point is to look for large periods first. Note that S[i] is the ith term of S skipping any "(", ")" or power.

i-loop is O(n)
j-loop is O(n)
r+k-loops is O(log(n)) as it stops at first difference

This is globally O(n²log(n)).

Word factorization in $O(n^2 \log n)$ time

2 Answers2