How to implement a maximal munch lexical analyzer by simulating NFA or running DFA?

Question

I'm planning to implement a lexical analyzer by either simulating NFA or running DFA using the input text. The trouble is, the input may arrive in small chunks and the memory may not be enough to hold one very long token in the memory.

Let's assume I have three tokens, "ab", "abcd" and "abce". The NFA I obtained is this:

And the DFA I obtained is this:

Now if the input is "abcf", the correct action would be to read the token "ab" according to the maximal munch rule and then produce a lexer error token. However, both the DFA and the NFA have state transitions even after "ab" has been read. Thus, the maximal munch rule encourages to keep on reading after "ab" and read the "c" as well.

How do maximal munch lexers solve this issue? Do they store the entire token in memory and do backtracking from "abc" to "ab"?

One possibility would be to run the DFA with a "generation index", potentially multiple generations and multiple branches within generation at a time. So, the DFA would go from:

{0(gen=0,read=0..0)},

read "a",

{1(gen=0,read=0..1)},

read "b",

{2+(gen=0,read=0..2,frozen), 2+(gen=0,read=0..2), 0(gen=1,read=2..2)},

read "c",

{2+(gen=0,read=0..2,frozen), 3(gen=0,read=0..3)},

read "f",

{2+(gen=0,read=0..2,frozen)}.

Then the lexer would report state 2+, and since there is no option to continue, would report an error state. Not sure how well this idea would work...

For "abcd", it would work like this:

{0(gen=0,read=0..0)},

read "a",

{1(gen=0,read=0..1)},

read "b",

{2+(gen=0,read=0..2,frozen), 2+(gen=0,read=0..2), 0(gen=1,read=2..2)},

read "c",

{2+(gen=0,read=0..2,frozen), 3(gen=0,read=0..3)},

read "d",

{2+(gen=0,read=0..2,frozen), 4+(gen=0,read=0..4,frozen), 4+(gen=0,read=0..4), 0(gen=1,read=4..4)}.

Now of these, it's possible to drop the first (there is a longer match) and the third (there are no state transitions out), leaving:

{4+(gen=0,read=0..4,frozen), 0(gen=1,read=4..4)}.

Then the lexer would indicate "match: 4+" and continue reading input from state 0 using generation index 1.

Is this idea of mine, running DFAs nondeterministically, how maximal munch lexical analyzers work?

score 3 · Accepted Answer · answered Sep 15 '18 at 23:41

There are two ways to handle this issue:

The most common implementation (the one used in lex, flex and other similar scanner generators) is to always recall the last accept position and state (or accept code). When no more transitions are possible, the input is backed up to the last accept position and the last accept state is reported as the accepted token.

If you're trying to do streaming input, you will need a fallback buffer to handle this case.
Alternatively, if the scan reaches an accepting state but another transition is available, we can start performing two scans in parallel: one on the assumption that the transition will be taken, and the other on the assumption that it will not. The second thread may need to fork again, although there is a maximum number of forks, as with generalised LR parsing. In this model, we need to keep a buffer of possible "future" tokens which will be processed if the optimistic thread fails.

I don't know of a practical implementation of the second strategy in a general purpose scanner generator, although there are some papers about how you might do it. Apparently it can be done in time and space linear to the size of the input, which is (in theory) better than the quadratic time consumption of backtracking.

However, it is pretty rare that you find a token grammar which needs to allow unrestricted backtracking. The most common cause of unrestricted backtracking is failing to take into account the fact that things like quoted strings might not be correctly terminated in an incorrect program, so you end up with just the rule:

["]([^"]|\\.)*["]   { Accept a string }

instead of the pair of rules

["]([^"]|\\.)*["]   { Accept a string. }
["]([^"]|\\.)*      { Reject an unterminated string. }

(Maximal munch will guarantee that the second rule will only be used if the first rule cannot match.)

So while the second strategy may have some theoretical appeal, it seems to me that it's of little practical use. Flex even has some options which will help you to identify rules which could backup on failure, and this can help you craft your lexical grammar to avoid the problem. It's not always easy to eliminate 100% of backing up (although it often is, and if you manage to do so, flex will reward you by generating a faster lexer), but it's pretty rare to find a lexical grammar which requires more than a few characters of back-up, and the cost of a small fallback buffer is really not worth worrying about, in comparison with the complexity of the alternative (which, of course, also needs extra memory.)

I have seen intermediate strategies for particular grammars. If you know your grammar well enough, you could hand-build the speculative tokenisation in order to avoid backing up. I've seen that, years ago, in SGML lexers which eliminate the rescan of > following a tagname by including a redundant rule which recognised a tag immediately followed by a > and handled both tokens at once. That must have saved a few cycles, but it's hard to believe that it really made a huge difference, and the difference would likely be even less significant today. Still, if you are the type who obsesses about saving every possible cycle, you could do it.

How to implement a maximal munch lexical analyzer by simulating NFA or running DFA?

1 Answers1