2

I wish to find the CFG for a language on two symbols (say a and b) whose words begin and terminate with the same symbol, and have equal quantities of a's and b's. What is the thought process I should use for finding such a grammar? What is the most natural or simplest grammar for this language? I hope you'll explain your answer. Hopefully this will suggest some patterns I should look for when trying to synthesise a grammar for a specified language.


Here's the best solution I could come up with on my own:

$ S \to aTbbTa \mid bTaaTb$

$ T \to abT \mid baT \mid aTb \mid bTa \mid \epsilon$

I think this grammar is correct. Informal argument: I can see that if the word is $awa$ ($w$ being a substring), then there are at least two $b$'s in $w$ that are adjacent. This suggests the form $aTbbTa$ in the first production rule. (The argument holds if the roles of the two symbols are reversed). The second set of production rules is meant to generate every possible word of the language while keeping the number of $a'$s and $b'$s equal. Symmetry suggests that the rule $abT$ should be accompanied by the rule $baT$, and $aTb$ by $bTa$. Initially was wondering if any of the rules in the second set was redundant, but I don't think so - I can think of words that couldn't be formed if any of the second set of production rules was missing. Rather I need to be sure there aren't any words from the language that my grammar can't generate.

[I guess I would need induction to prove my grammar generates every possible word in the language. But right now I'm more interested in the thought process behind coming up with a grammar, and as far as I know, induction (in general) doesn't help much in synthesising a solution/rule/formula/etc.; it principally serves to verify a purported solution.]

D.W.
  • 167,959
  • 22
  • 232
  • 500
A.K.
  • 145
  • 5

1 Answers1

3

Here is the thought process I would use. I would notice that your language $L$ can be written as $L= L_1 \cap L_2$, where $L_1$ is the set of words that begin and end with the same symbol, and $L_2$ is the set of words that have equal quantities of a's and b's.

Then, I would note that $L_1$ is regular and so can be expressed by a simple DFA (with 5 states). Also, I would note that $L_2$ is context-free and can be expressed by a simple CFG: e.g., $S \to \varepsilon \mid aSb \mid bSa \mid SS$.

Finally, I would recall the standard closure property: the intersection of a regular language and a context-free language is context-free. It follows that $L_2$. The proof of this standard closure property includes a construction that shows how to construct a CFG for $L_2$, given a DFA for $L_1$ and a CFG for $L_2$. This gives a CFG for your language. The resulting CFG will have $5^2=25$ non-terminals; some of them are unreachable and can be pruned, but the result is still quite messy.

The resulting CFG isn't the simplest or smallest CFG for your language. Whether it is the most natural is open for debate. But since you asked for the thought process behind coming up with a grammar, this illustrates one general technique for constructing CFG's: separate out the part that requires something more than finite state (equal quantities of a's and b's) from the part that can be handled with a finite-state automaton.

D.W.
  • 167,959
  • 22
  • 232
  • 500