Following yesterday's StackOverflow outage - is regular expression matching really difficult, or is the implementation simply inefficient?

Question

Yesterday StackOverflow was down for half an hour. Later, they wrote a blog post about it, detailing that the problem stemmed from unexpectedly high complexity of regular expression matching.

In short, the regular expression a+b, when running on the string aaaaaaaaaaaaaac, runs in $O(n^2)$ time where $n$ is the number of a characters, because it uses backtracking.

You can reproduce the issue with the following Python code, which on my computer, takes over 4 seconds to run:

import re, time
start = time.time()
re.findall(r'\s+$', ' '*20000 + 'x')
print(time.time() - start)

This was very surprising to me; I'd have thought that a regex matcher works more efficiently, e.g. by constructing a DFA from the regex and then running the wanted string through it, which I'd have thought would be $O(n)$ (not including the DFA construction).

(For instance, the book Introduction to Algorithms by Cormen, Leiserson, Rivest goes through a similar algorithm on the way to introducing the Knuth-Morris-Pratt algorithm).

My question: Is there something inherently difficult in regular expression matching that does not allow an $O(n)$ algorithm, or are we simply talking about an inefficient implementation (in Python, in whatever StackOverflow uses, etc.)?

score 9 · Answer 1 · edited Apr 13 '17 at 12:48

If you only wanted to parse regular expressions then you wouldn't have such problems (unless you were a really incompetent programmer, I guess). Caveats include the time needed to build an automaton; an asymptotically worse algorithm may outperform the automaton approach in many cases in practice.

The real issue is probably that they use library functions that deal with regexps which are way more powerful than plain regular expressions. Also, features like matching groups introduce further complexity.

In this case, trouble arose because these engines match substrings (with plain regular expressions, we typically only match whole inputs) and use backtracking; long partial matches that eventually mismatch cause long backtracks. In essence, this is the worst-case for naive, quadratic-time string matching.

Can this be improved? Maybe. Using the ideas from string matching automata we would backtrack not to the second symbol, but to the start of the longest suffix that matches a prefix of the pattern. But since the pattern is not fixed anymore, it's certainly not trivial to extend the ideas.

Following yesterday's StackOverflow outage - is regular expression matching really difficult, or is the implementation simply inefficient?

1 Answers1