12

In an LR(0) parser, each state consists of a collection of LR(0) items, which are productions annotated with a position. In an LR(1) parser, each state consists of a collection of LR(1) items, which are productions annotated with a position and a lookahead character.

It's known that given a state in an LR(1) automaton, the configurating set formed by dropping the lookahead tokens from each LR(1) item yields a configurating set corresponding to some state in the LR(0) automaton. In that sense, the main difference between an LR(1) automaton and an LR(0) automaton is that the LR(1) automaton has more copies of the states in the LR(0) automaton, each of which is annotated with lookahead information. For this reason, LR(1) automata for a given CFG are typically larger than the corresponding LR(0) parser for that CFG.

My question is how much larger the LR(1) automaton can be. If there are $n$ distinct terminal symbols in the alphabet of the grammar, then in principle we might need to replicate each state in the LR(0) automaton at least once per subset of those $n$ distinct terminal symbols, potentially leading to an LR(1) automaton that's $2^n$ times larger than the original LR(0) automaton. Given that each individual item in the LR(0) automaton consists of a set of different LR(0) items, we may get an even larger blowup.

That said, I can't seem to find a way to construct a family of grammars for which the LR(1) automaton is significantly larger than the corresponding LR(0) automaton. Everything I've tried has led to a modest increase in size (usually around 2-4x), but I can't seem to find a pattern that leads to a large blowup.

Are there known families of context-free grammars whose LR(1) automata are exponentially larger than the corresponding LR(0) automata? Or is it known that in the worst case, you can't actually get an exponential blowup?

Thanks!

templatetypedef
  • 9,302
  • 1
  • 32
  • 62

2 Answers2

4

The grammar

$$\begin{array}{l} S \rightarrow T_0 \\ T_n \rightarrow a \; T_{n+1} \\ T_n \rightarrow b \; T_{n+1} \\ T_n \rightarrow b \; T_{n+1} \; t_n \\ T_N \rightarrow t_N \end{array} $$

has the LR(0) state $$T_N \rightarrow t_N \dot \\$$ expanded to $2^N$ variants in the LR(1) automata as all the partitions of $\{t_0 \dots t_{N-1}\}$ are possible look-head which appear in different contexts. The number of states in the LR(0) automaton on the other hand is linear in term of $N$. Thus an expansion factor of the order of $2^N/N$ is possible.

Edit: I'll have to check later when I've more time, I think adding $T_N \rightarrow T_0$ would give the exponential factor on nearly all the LR(0) states. That result in a shift-reduce conflict.

AProgrammer
  • 3,099
  • 18
  • 20
0

Such lower bounds are sometimes tricky to construct and may evoke deeper CS theory (eg in cases, complexity class separations). This paper seems to give a theoretical construction/ lower bounds you seek eg in Theorem 5 which puts a lower bound on total symbols and therefore also states. The references also include other similar constructions/ lower bounds.

Theorem 5. Let $f(n,k) = 2^{\frac{1}{4}(n - k)} / n^2$. For any LR(k)-grammar with $k = 0,1;...,n−1$ generating $L_n$; where $n \geq 3$; the number of nonterminal symbols must be at least $f(n,k)$ or there exists a nonterminal symbol A such that the number of different productions with A on the left hand side must be at least $f(n,k)$.

On the size of parsers and LR(k)-grammars / Leunga, Wotschkeb

Jake
  • 3,810
  • 21
  • 35
vzn
  • 11,162
  • 1
  • 28
  • 52