7

I have an application that needs to compute billions of levenshtein distance between pairs of strings. The strings are short (70 in length) DNA sequences, consisting only of 4 characters. Also it can be assumed that one of the strings is fixed, i.e., we are comparing one fixed string to a billion other strings.

I know that the dynamic programming implementation of the levenshtein distance is $\mathcal{O}(m n)$, would like to know if there are any room for improvement. I found these two algorithms:

  • $\mathcal{O}(n + d^2)$ algorithm, in which $d$ is the edit distance by Berghel et al. However I can't make the assumption that $d$ is small so it might not give any advantage
  • $log(n)^{\mathcal{O}(1/\epsilon)}$ approximation in $n^{1+\epsilon}$ time by Andoni et al. But I have two concerns regarding this:
    • Is this algorithm also fast in practice?
    • Does $log(n)^{\mathcal{O}(1/\epsilon)}$ mean the computed edit distance in worst case is $log(n)^{\mathcal{O}(1/\epsilon)}$ times the actual one? In that case it's too much.

Do you know of any other algorithm/idea/approach that might be applicable?

Ameer Jewdaki
  • 539
  • 2
  • 14

2 Answers2

3

One approach is to build a Levenshtein automaton for the fixed string (see, e.g., here). Given a string $x$ and a distance $D$, you can build a DFA that recognizes all strings that are at distance $\le D$ from $x$. Thus, you can test whether a string is close to $x$ in $O(n)$ time, where $n$ is the length of the string. I'm not sure what the space requirements are to store the DFA (they are linear in $m,n$, but might be exponential in $D$).

Alternatively, you could use an "early-out" algorithm for computing the edit distance. You mentioned that you are only interested in the edit distance if it is less than some threshold $D$. There is an "early-out" algorithm for computing the edit distance whose running time is $O(\max(n,m) \times D)$, which computes the edit distance if it is $\le D$ or else outputs "too big" if it is $>D$. Basically, you do the standard dynamic programming algorithm for the edit distance, but only compute the elements of the matrix that are $\le D$ away from the diagonal. In your case this might or might not be better than the other alternatives.

D.W.
  • 167,959
  • 22
  • 232
  • 500
-1

If I had to do billions and it was only 4 characters I would represent the characters as
1000
0100
0010
0001
It is then a 35 byte integer

Dot a bit-wise and and count the 1's

Not perfect but billions is a lot unless you throw a lot of CPU at it.

paparazzo
  • 431
  • 3
  • 13