3

I have used Rabin Karp Rolling Hash for searching a pattern $P$ in a text $T$. Now I am allowing $k$-mismatches, but not able to do a faster implementation.

I tried modifying RK algorithm by splitting the pattern into few blocks, but that does not improve the speed.

I'm trying to use locality sensitive hashing, but not sure how could I calculate hash with sliding window manner. Any help would be appreciated. For my case, the length of $P$ is 50~75 and $T$ is 100~150.

Alwyn Mathew
  • 551
  • 2
  • 13
CPP_NEW
  • 31
  • 3

1 Answers1

4

Split into segments

One simple approach is to split $P$ into $k+1$ pieces, say $P = P_0 P_1 \cdots P_k$, as @j_random_hacker suggests. You can make each of the $P_i$ be of approximately equal length, though this is not required. Then, for each $i$, search for all instances of $P_i$ in $T$ (with no mismatches), and see if it can extend to an instance of $P$ with at most $k$ mismatches. If there is a solution for your original problem, then this procedure will find it.

For non-degenerate cases, if every $|P_i|$ is large compared to $\log_{|\Sigma|} |T|$ and $P,T$ are more or less random, we can expect that there will be few "false matches" of a piece, and consequently we can expect that the running time will be approximately $O(k \cdot |T|)$. Thus this should be pretty fast in your specific situation.

This approach allows you to use any fast string-matching algorithm to search for instances of $P_i$ in $T$; e.g., you can use Rabin-Karp or any other existing algorithm -- this is nice because you can re-use existing well-tuned string matching libraries. This approach might be not-too-difficult to implement and fairly efficient in practice.

Levenshtein automaton

You can build a NFA (nondeterministic finite-state automaton) to recognize instances of $P$ with at most $k$ mismatches. In particular, the NFA recognizes all strings $S$ that end with something that matches $P$ with at most $k$ mismatches. The NFA has $(k+1) \times |P|$ states: each state is of the form $(m,i)$, where $m$ counts the number of mismatches so far and $i$ counts the length of the prefix of $P$ that has been matched so far.

You can convert this NFA to a DFA. Heuristically, I expect the size of the DFA to be pretty small, say $O(k \cdot |P|)$, though I don't know of any proof of this. Next, you can run the DFA along the text $T$, to find the first place where there is a match, which takes $O(|T|)$ time.

This might be fast if $k$ is not too large. If $k$ gets too large, it's possible the DFA will be enormous and the approach might fail. This is probably most useful if you plan to re-use the same pattern $P$ with many different texts $T$, or if $T$ is far longer than $P$. I wouldn't recommend it for your specific situation.

I have no proof that the resulting DFA will be small. See also Levenshtein automata, which consider the far harder problem of building a DFA to recognize all strings $S$ that end with something that has edit distance $\le k$ to $P$ (i.e., they allow for at most $k$ mismatches, insertions, and/or deletions, in total). The resulting DFA is much more complex, yet there are amazing algorithms to build the edit-distance DFA in running time that is linear in $|P|$ (but possibly exponential in $k$). Your problem is considerably easier and will probably lead to DFA that are significantly smaller.

D.W.
  • 167,959
  • 22
  • 232
  • 500