I am taking the Coursera class on compilers and in the lesson about lexers it is hinted that there is a time-space tradeoff between using non-deterministic finite automaton (NFA) and deterministic finite automaton (DFA) to parse regular expressions. If I understand correctly, the tradeoff is that a NFA is smaller, but is more time consuming to traverse because all possible states have to be regarded at the same time and therefore it is most of the time transformed into a DFA. Are there any lexers that use NFAs instead of DFAs in "real"-life i.e. some compiler that is used in production and not a just a proof of concept?
4 Answers
Compiled lexical analyzers compile the NFA to a DFA.
Good interpreted regular expression matchers, on the other hand, use Thompson's algorithm, simulating the NFA with memoization. This is equivalent to compiling the NFA to a DFA, but you only produce DFA states on demand, if they are needed. At each step your deterministic state is a set of NFA states, then given the next input character you transition to a new set of NFA states. You cache previously seen states and their output transitions in a hash table. The hash table is flushed if it fills up, it does not grow without bound.
The reason you do it this way is that converting the NFA to DFA can take time exponential in the size of the regular expression. This is certainly not something you want to do if you are only evaluating the regular expression once.
RE2 is an example of a regex engine that (essentially) uses Thompson's algorithm. I can highly recommend the brilliant blog posts by RE2's author Russ Cox if you want to learn more (including lots of historical information and experimental comparisons of lots of different approaches to regex searching.
I can also highly recommend the "why GNU grep is fast" email chain. Lesson 1 is: the common case for regex search is simple string search, so special case your algorithm.
- 17,863
- 1
- 46
- 87
I see only two applications of using an NFA (or rather its power automaton without writing it down) instead of a minimized DFA:
- Homoiconic languages, where you may want to modify your lexer frequently
Strange syntax that may blow up your DFA like
identifier := [a-z][a-z0-9_]* indices := [0-9_]{1,256} //up to 256 times var := identifier "_" indices | identifierIf you take the last rule as a precedence, your lexer has to check whether an identifier contains "_" within the last 256 symbols and shorten it in this case.
- 2,339
- 1
- 17
- 32
I'd be surprised if they did. The construction of the lexer is done once (hopefully), the result used millions of times (just think how many tokens there are in your medium-sized source file). So, unless there are very unusual circumstances, it pays off to make the lexer as fast (and other resource frugal) as possible, i.e., go for a minimal DFA.
- 14,204
- 3
- 42
- 52
In the strict formal sense, no. Non-determinism in the theory/mathematical sense allows a machine to choose a computation path based on whether it eventually leads to an accepting state or not without looking any further ahead in the input. So in this strict sense it's a property that's only suitable for theoretical examination, and there's no such thing as a real non-deterministic machine, particular to this case you can't actually build an NFA, unless you can see into the future, in which case building a compiler with this talent is a bit of a waste! ;).
However, nondeterministic and nondeterminism are often used in a weaker sense, hazily defined. Sometimes it can mean randomised/probabilistic - the algorithm flips a coin, in a formal setting this is studied as probabilistic/randomised algorithms, and not referred to as nondeterminism. Another use is for an algorithm that doesn't necessarily produce the same output given two runs on the same input - it may not be random, but some part of its behaviour is unspecified, so there may be several valid outputs (personally I think this definition comes comes from confusing un-determined and non-deterministic.
Nonetheless, you could, in principle build a lexer that is nondeterministic in one of these weaker, informal senses, however it wouldn't be an NFA (that's a strict formal machine model) and I can't imagine it'd be a crash hot idea either - a lexer needs to be quite predictable.
The last option is that you can simulate non-determinism via backtracking or parallelism, but in this case you lose out on the apparent efficiency of non-determinism, as you're effectively turning it into a deterministic computation, so you're no better off than with a DFA.
- 18,373
- 4
- 60
- 87