4

I just started reading "Parsing Techniques, A Practical Guide", Second Edition, by Dick Grune and Ceriel J.H. Jacobs.

On page 12, the authors start describing a set of rules that can be used to generate the set of all enumerations of names of the type "tom, dick and harry": the rules allow single names (e.g. "tom") as well as repetitions ("tom, harry, dick, dick and harry"); multiple names in an enumeration are separated by commas except the last two names which are separated by "and", so the following aren't valid: ("tom, harry, dick") or ("harry and tom and dick").

A few pages later, and after having defined some more terms and formalisms, the authors come up with the following replacement rules (parse structure grammar) to generate the sentences of the desired type:

0. Name -> tom | dick | harry
1. Sentence-> Name | List End
2. List -> Name | Name, List
3. , Name End -> and Name

In the above, Sentence is the start symbol.

However it seems to me that these rules can generate incorrect sentences: if we replace Sentence by List End and List by Name we end up with Name End for which no replacement rule is defined.

It seems to me that the 2nd line in the rules above, if replaced by

Sentence -> Name | Name, List End

would fix this problem.

Am I correct that the authors have made an oversight, and is my modification correct? Or have I misunderstood something?

I don't have a CS background and this is the first time I'm reading about parsing, so please keep that in consideration in your replies. Thanks!

Raphael
  • 73,212
  • 30
  • 182
  • 400
Aky
  • 225
  • 1
  • 8

1 Answers1

3

You are correct, in a sense. Speaking in formal notation, you have

$\qquad \displaystyle \mathtt{Sentence} \Rightarrow \mathtt{List}\ \mathtt{End} \Rightarrow \mathtt{Name}\ \mathtt{End}$

and we can not get rid of $\mathtt{End}$. We have not generated a wrong word, but a sentence that can not be derived further to any word. That is not a mistake in the grammar itself (with respect to the generated language), but certainly nasty if you think of parsing.

Your modification is valid. In particular, we do not need to generate lists of length one using rule 2 (which you make impossible by your modification) because we can just alternative 1 of rule 1.

Note that the grammar given is not the greatest (for parsing). For example, it is not context-free (rule 3). A better way to implement detection of the end is to generate the list the other way round ($\_$ denotes a space):

$\qquad \displaystyle \begin{align} \mathsf{Sentence} &\to \mathsf{Name} \mid \mathsf{List}\ \mathtt{and}\ \mathsf{Name} \\ \mathsf{List} &\to \mathsf{Name} \mid \mathsf{List}\ \mathtt{,\_}\ \mathsf{Name} \\ \mathsf{Name} &\to \mathtt{tom} \mid \mathtt{dick} \mid \mathtt{harry} \end{align}$

Of course, this grammar is now left-recursive which is problematic for some parsing strategies (e.g. LL). So it may be useful to create the list from left to right after all, but keep generating the $\mathtt{and}$ at the start:

$\qquad \displaystyle \begin{align} \mathsf{Sentence} &\to \mathsf{Name} \mid \mathsf{List}\ \mathtt{and}\ \mathsf{Name} \\ \mathsf{List} &\to \mathsf{Name} \mid \mathsf{Name}\ \mathtt{,\_}\ \mathsf{List} \\ \mathsf{Name} &\to \mathtt{tom} \mid \mathtt{dick} \mid \mathtt{harry} \end{align}$

Now a parser working from left to right can always identify the next rule (with a lookahead of two to catch whether there is a comma behind a name).

Raphael
  • 73,212
  • 30
  • 182
  • 400