29

Informal Problem Statement:

Given a string, e.g. $ACCABBAB$, we want to colour some letters red and some letters blue (and some not at all), such that reading only the red letters from left to right yields the same result as reading only the blue letters.

In the example we could colour them like this: $A\color{blue}{C}\color{red}{CAB}B\color{blue}{AB}$

Therefore, we say $CAB$ is a repeated subsequence of $ACCABBAB$. It is also a longest repeated subsequence (which is easy to check).

Can we compute longest repeated subsequences efficiently?

Formal Question:

Is it NP-hard to decide for a string and some $k$, whether a repeated subsequence of length $k$ exists in the string?

  • If so: Which problem can be reduced to this problem?
  • If not: What is an efficient algorithm? (obviously, this algorithm can then be used to compute a longest repeated subsequence)

Bonus Question:

Will their always be a repeated subsequence of length $n/2 - o(n)$ if the size of the alphabet is bounded by a constant?

(This is known to be true for binary alphabets.)

Edit 2: The negative answer to the Bonus Question is already known for alphabets of size at least $5$. In fact for alphabets of size $Σ$, there are strings with longest repeated subsequences of a length of merely $O(n · Σ^{-1/2})$. Random strings suffice to show this. The result already existed, but I overlooked it.

Edit: Note:

Some people mean "substring" when they say "subsequence". I don't. This is not the problem of finding a substrings twice.

xskxzr
  • 7,613
  • 5
  • 24
  • 47
Sekti
  • 393
  • 2
  • 6

2 Answers2

6

The special case of $k = n/2$ is the same problem as this CST.SE question How hard is unshuffling a string? asks.

Buss and Soltys proved NP-completeness of this problem [1] by reducing 3-Partition problem to this problem.

  • [1]: Buss, Sam, and Michael Soltys. "Unshuffling a square is NP-hard." Journal of Computer and System Sciences 80.4 (2014): 766-776.
pcpthm
  • 2,962
  • 6
  • 16
-3

This can be solved in polynomial time by constructing a graph $G$ where each node represents a point $(i,j)$ in some repeated subsequence of $S$ such that $S[i]=S[j]$. Edge between nodes $u$ and $v$ means that $u$ can be continued by $v$ to form a repeated subsequence of length 2.

1. Find the nodes. This can be done in $O(n^2)$ time by building a sorted list of indices for each character $c$, and then enumerating the unique pairs. There are no more than $m=n^2$ nodes.

2. Find the edges. It takes $O(1)$ time to check if node $u$ can be continued by node $v$, so by considering all pairs $(u,v)$ this step takes $O(m^2)$ time.

3. Note that the longest path in $G$ may not be a valid repeated subsequence. Consider paths $ab$ and $bc$. If there exists an edge $ac$ then $abc$ is a valid repeated subsequence of length 3. Therefore it takes $O(m^4)$ time to find all repeated subsequences of length 3. In the general case it takes linear time to check whether two valid paths of length $n$ can be combined into a valid path of length $n+1$.

4. Iterate step 3 until no longer paths can be found.

noplogist
  • 27
  • 4