Expressive power of lexer + parser

Question

Most modern compilers split their syntax analysis into a lexical phase that is followed by a parsing phase. The lexical phase is given by a regular expression, while parsing is guided by a context-free grammar.

What class of languages can be recognised this way? Context-free languages, or can we get into context-sensitive languages?

babou · Accepted Answer · 2014-08-02T12:28:08.760

As stated in the question, this recognizes only context-free languages.

Lexical analysis is actually a GSM mapping, i.e. a finite state transduction. Since CF languages are a full Abstract Family of Languages (full AFL), they are closed under inverse GSM mapping. Hence, if the sequences of lexemes parsed by the second phase belong to a context-free language, the original texts as a sequence of characters also belong to a context-free language. This 2 levels organization does not change the CF characters of syntax, assuming the second level is indeed context-free.

One point is worth noting, though it is a minor technical issue. To have this CF character of the second phase, it is necessary that identifiers be replaced by a finite set of standard categories represented each by a unique symbol. The same is true for literals such as strings or numbers. The point is that many languages do not in principle put any limit on the size of identifiers, or some literals, so that there is in principle an infinite number of them. But the context free grammar of the second phase must have a finite alphabet, like any CF grammar. Hence the lexical elements returned by the first phase cannot actually distinguish all identifiers, or all literals, for the second phase.

However the information is kept by other means for the more semantical phases of the compiling process.

It is true that any given program does have a finite number of identifiers and literals, but the point is that we are considering the language of all possible programs.

Another point is of course that texts of compiler accepted programs do not form a context free language as other constraints (such as declarations of identifiers) are imposed on programs, in parallel to the strictly syntaxic parsing process.

But, ignoring those more "semantics" constraints, some parsers also use a variety of parsing "tricks", such as prioritization of reduction rules to avoid ambiguities or non-determinism. As long as this is done with a finite memory and a pushdown stack that is used only within a fixed finite distance from its top, this does not get the syntax out of the context-free realm. However, what extra parsing rules may exist for the parser also have to be known to the users of the language, thus making the understanding of the syntax more complex to users, and giving more ground for misunderstanding. TANSTAAFL. Supposing this is used to reduce ambiguity, it could then happen that the user will read his program one way, while the parser reads it another way.

My own preference is for general CF parsers, that will detect ambiguity (which is, of course, decidable for a given string of an ambiguous language), and reject it as a programming error. But that is much a matter of taste in design.

Including the lexical phase into the context-free syntax may possibly make those extra rules more complex, as reasonning can no longer isolate lexical issues fron CF syntactic ones. It can also make the automaton underlysing the parser more complex, or call for more extra rules. Typically, if parsing the CF part of the syntax requires a lookahead that may include an identifier, the look-ahead becomes unbounded when identifiers are not reduced to one symbol by a lexical phase. Much depends also on the chosen parser generation technology.

Though there are some scannerless parsers used, I would think this applies more to languages with specific lexical and syntactic characteristics. But I am no expert on this, and the wikipedia page seems a bit weak on the general presentation side.

Finally, another interesting point, close to the initial question is error recovery. It is used by compilers so as to be able to catch many errors at each run, and was important when compiling was done in batch mode, rather than interactively. Many syntactic error recovery techniques are based on a formal model of finite state generation of errors, such as missing or extra symbol, or local garbling of the strings (this is also true for natural language processing). This can also be modeled by a GSM. Hence, ignoring again semantical aspects, the larger language of programs with syntactic errors accepted by a parser with error correction is also context-free.

There is more in the comments, but how long should an answer be?

Expressive power of lexer + parser

1 Answers1

Linked