Find the 'best' longest common subsequence

Question

I am writing a program that computes and displays diffs. I implemented Meyers algorithm that computes the LCS between 2 subsequences (seq1 and seq2); its output is one of the possible LCS and a partition of seq1 and seq2, one projection of which is lcs.

I want to improve it so that the LCS displayed minimizes the number of breaks; to do so, I implemented a function f(lcs, seq):

seq is a sequence of characters
lcs is a subsequence of seq
the output is a partition of seq p0, p1, p2, ... pn such as
- either p0 + p2 + or p1 + p3 + ... is lcs
- and n is minimal

I did so using some sort of BFS, at each step finding the next element non-covered in lcs in seq and greedily expanding the common part.

The resulting algorithm is quite slow: on a typical input, ~3x slower that the Myers algorithm which seems to compute something way more complex. See code here: https://github.com/mookid/diffr/blob/c9ed7746193fd9833ddce1237d6e5005e91deaf4/diffr-lib/src/best_projection.rs

Am I missing a better algorithm?

score 1 · Answer 1 · answered Aug 15 '21 at 21:31

Your applications may be different from mine, but I think want you want is to input two strings $a$ and $b$ and output a diff $d$ such that no diff is shorter than $d$ and no shortest diff has fewer breaks than $d$. By a diff I mean the algebraic data type List of ((CommonToBoth | FromLeftString | FromRightString) * Char), best expressed in Rust as a sequence of a three-variant Enum, each storing a character (or a T, if generic). Minimizing breaks amounts roughly minimizing the number of runs of CommonToBoth. [Concretely I call mine Common, Insert and Delete.]

I'm not sure that computing a break-minimizing matching $m_a$ between $d$ and $a$ and a break-minimizing matching $m_b$ between $d$ and $b$ will let you compute a break-minimizing matching $m_{ab} = g(m_a, m_b)$ between $d$ and [$a$ and $b$ simultaneously], which is the thing I think you really want.

I've managed to compute a shortest diff with the smallest number of breaks, using an adaptation of Wagner-Fischer (dynamic programming, $O(n^2)$ space for a big table).

Normally you store the LCS length in the Wagner-Fischer table. Instead of doing that, you can choose a score of your own and store it instead. I've used triples similar to (lcsLength, numberOfBreaks, isCurrentRunSharedOrEdits). There's a trick: when characters don't match you need to do something more complicated than max, and when they do it's not always dominant to include the particular match you're looking at: table entry (i, j) is the best among adjustScoreInMatchingCase(table[i-1, j-1]) and combineMismatchingScore(table[i-1, j], table[i, j-1]), where the two functions depend on the particulars of how you score things.

This works, but takes $O(n^2)$ time and space. I can't do much about asymptotic time (provably so if the exponential time hypothesis is true, IINM). I think you can improve the space requirements by only storing the previous and current row of the table; you don't refer back more than a single row in the Wagner-Fischer algorithm.

Incidentally, I think the logic in the Myers algorithm which cleverly constructs and (re)uses the v array doesn't work in this case; the fact that if a[i] == b[j] you don't always extend from LCS(a[..i-1], b[..j-1]) is rooted in the reason why we can't trivially extend the v array. But I might be wrong, I haven't fully explored the problem.

Postscript: Unless smake is a new build tool I haven't heard about, you probably want to rename max_sMake_len ($m \rightarrow n$). If you like property testing you may want to steal my tests at https://github.com/jonaskoelker/equate/blob/master/test/EquateProperties.scala; but beware, one or two of them have slightly wrong labels.

Find the 'best' longest common subsequence

1 Answers1