Why using finite automata in implementing lexical analyzers

Question

I studied the subject of building a lexical analyzer using finite automata, through the classical way: Regular Expression -> NDA -> DFA. And I found it elegant, and a more solid approach for building a lexer.
Now my question is, what are other benifits from building a lexer this way, rather than the "ad hoc" way?

score 3 · Accepted Answer · answered Jun 04 '17 at 02:07

Speed and simplicity. A regular expression is a compact and declarative specification of lexemes. For most realistic regular expressions, simply writing code to mimic that specific regular expression would likely be something like an order of magnitude longer and less clear. This is especially true if you want high-performance.

A regular expression, on the other hand, can be compiled to very efficient code and is relatively simple to understand. You also get guarantees that are difficult or impossible to verify of code. For example, a regular expression is guaranteed to be implementable using only a fixed, finite amount of memory and matching is, of course, guaranteed to terminate. This is very similar to the reasons we use primarily SQL to query relational databases rather arbitrary code.

So far this has been an answer to the question of modeling/implementing lexical analysis with regular expressions, and not to why we would implement regular expressions with DFAs. Again, the main reasons are speed and simplicity. There are however, many other options including: directly interpreting the regular expression operators, compiling regular expressions to machine code, differentiating regular expressions, NDAs clearly, converting a regular expression to a CFG and applying CFG techniques. Many of these approaches are arguably "inherently" slower than DFAs, e.g. using CFG techniques suggests that there will be overhead to support a degree of flexibility that is not used.

There are two obvious ways of implementing a DFA at least when the alphabet is reasonably small. One is to have a two dimensional table indexed by state and character. This approach is modular in that you can write the matcher once and just change the language by providing a different table (perhaps dynamically). Matching involves incrementing a pointer, one 2D table lookup, and a highly predictable jump. A second approach is to "store the state in the instruction pointer/program counter" by having each state be its own block of code. Matching in this case involves incrementing a pointer and a(n unpredictable) indirect jump. Ignoring instruction/data caching and branch prediction effects, it's hard to conceive of a more efficient approach than these. You are looking at two or three machine instructions per character of input. In the latter case, we can improve on the instruction caching and branch prediction performance by combining blocks of code together. Again, all this can be handled by a tool that takes a regular expression as input. There's no real way to beat this latter approach to implementing DFAs in performance except by improving instruction caching or branch prediction.

Why using finite automata in implementing lexical analyzers

1 Answers1