1

I'm looking for a particular case of longest common substring (LCS) problem. In my case I have two really big strings (tens or hundreds of milions byte characters) and need to find the LCS and other long strings.

A simplified example

$S_0$ = ABBCADGGHEEASSSCC

$S_1$ = ABDCADGGHMEASSSAC

the LCS = CADGGH (6 chars) and other long strings with 5 chars are CADGG, ADGGH and EASSS.

Which is the fastest algorithm to get all substrings with its length? (list all substrigs and legths) And in my case (very big byte substrings) which is the fastest LCS algorithm? (only get longest common substrings).

NOTE: In particular I don't have any space limit now, but this algorithm may be implemented in a mobile device in a future and is possible to have a very limited RAM/disk space (but always, at least, I have the same disk space available as the sum of file lengths).

Ivan
  • 121
  • 6

1 Answers1

1

I came across this problem before. I compressed the strings and applied dynamic programming on the compressed strings.

For compression techniques, I followed the methods described in this paper: http://pdf.aminer.org/000/145/966/data_compression_using_long_common_strings.pdf

The speed was decent. But, it consumed some space though.

user23183
  • 11
  • 1