2

Consider the following problem. We are given a set of patterns (strings) $\Pi = \{\pi_i\}$, a text $s$, and a window length $k$. We want a list of all shifts $0 \le i \le |s|-k$ such that every pattern in $\Pi$ is contained in the substring $s[i:i+k]$.

Can this be solved in linear- or near-linear-time? It can of course be solved in quadratic time $O(|s| |\Pi| + \sum |\pi_i|)$ using KMP or Aho-Corasick plus post-processing.

The motivation for this problem is finding matches for a topic (represented by the set of patterns) in a text. In that context it actually makes sense to require the matches to be non-overlapping so I'm also interested in that case, but it might be easier to start with the relaxed version.

I would also be interested in generalizations of the problem that allow for approximate matches of some kind, eg, only requiring a threshold on the size of the subset of matching patterns, allowing matches within given edit distance, or something using hidden Markov models or general probabilistic graphical models. I would be surprised if any such generalization can be solved in subquadratic time though.

dysonsfrog
  • 227
  • 1
  • 4

2 Answers2

1

The set of all $k$-length strings that contain every pattern in $\Pi$ is regular; call it $L$. (It is $(\Sigma^* \pi_1 \Sigma^* \cap \dots \cap \Sigma^* \pi_m \Sigma^*) \cap \Sigma^k$.) From this, you can form a nondeterministic finite-state automaton (NFA) that recognizes $\Sigma^* L$. Next, convert this to a DFA. Finally, run the DFA across the input string $s$. You can then obtain all locations in the string where the DFA is in an accepting state; each such location is an index $i+k$ so that $s[i:i+k]$ matches all patterns in $\Pi$.

The running time will be linear in $|s|$ and the size of the DFA. I expect the size of the DFA to be exponential in $\Pi$ and $k$, but independent of $|s|$. So, if you only care about the dependence on $|s|$, this algorithm might meet your requirements. If you additionally care about the dependence on $\Pi$ and/or $k$, I do not expect this will meet your requirements.

D.W.
  • 167,959
  • 22
  • 232
  • 500
1

Since you want to detect all patterns at once, you can throw away any pattern if it is a substring of another one.

Since no pattern is a substring of another one, occurrences can neither begin nor end at the same position.

This means that when the window slides, at most one new occurrence appears and at most one disappears.

You can easily count how many times each pattern occurs inside the window, while it slides: increment the pattern’s count (and store where the occurrence begins) when detected by Aho-Corasick, decrement once the beginning is outside the window.

This should solve your problem in $\mathcal{O}(|s|+\sum{|\pi_i|})$

Dmitri Urbanowicz
  • 1,083
  • 6
  • 12