1

I'm looking for pointers to algorithms which will find long connected similar subsequences which two given sequences have in common. For example, in case of two strings:

abcaabbaabUVWXYZ
UVWXeYZababababab

I'm interested in:

**********UVWXYZ
UVWXYeZ**********

Not in:

ab*aabba*b*****
******aba*ab*bab

(which would be one possible longest common subsequence for the given strings).

For the example above, e represents a (small) difference in the otherwise identical strings UVWXYZ and UVWXeYZ. This is where the similarity comes in. e is not necessarily an addition of a single character. It may be a change as well. When thinking on longer strings, multiple characters (even in direct succession) may be different.

The algorithm should probably be driven by a rating function for the length and the similarity of subsequences.

I'm aware that this problem is rather vague, so any pointers to possibly related problem domains and corresponding algorithms are appreciated as well.

Update: Removed exclusion criterion "LCS", because it actually seems to be what I'm looking for.

mstrap
  • 113
  • 4

2 Answers2

1

Let's call this problem LCCS, which is different with LCS. One straightforward way is to use a brute force algorithm with time complexity $O(nm(n+m))$.

 1. Input: A[1..n], B[1..m] 
 2. Output: Length of the longest common connected subsequence
 3. max = 0
 4. for i from 1 to n
 5.    for j from 1 to m
 6.        if (lowercase(A[i]) = lowercase(B[j]))
 7.           i' = i
 8.           j' = j
 9.           len = 1
 10.          while (i' <= n and j' <= m)
 11.             i'++
 12.             j'++
 13.             if (lowercase(A[i']) != lowercase(B[j'])) break;
 14.             else len++;
 15.          if (len > max) max = len
 16. return max
orezvani
  • 1,944
  • 15
  • 20
1

You might be interested in linear-time LCS variant with $k$ mismatches taking gaps with some penalty. Even though the core algorithm is with fixed gaps it is easily converted to your problem by just running algorithm with several gap lengths (naive approach). Better one would be to change the function calculating distance with function over penalties.

Another approach would be to take suffix tree modified to ignore gaps of length at most $k$ and then calculating penalties on the results (starting from longest one and then prunning results based on calculated best so far match including gap penalties). Optimized tree or DAG will decrease the memory footprint.

If you are interested in circular match simply concanate one string with itself.

Evil
  • 9,525
  • 11
  • 32
  • 53