11

From my reading it seems that most grammars are concerned with generating an infinite number of strings. What if you worked the other way around?

If given n strings of m length, it should be possible to make a grammar that will generate those strings, and just those strings.

Is there a known method for doing this? Ideally a technique name I can research. Alternatively, how would I go about doing a literature search to find such a method?

Raphael
  • 73,212
  • 30
  • 182
  • 400
Gustav Bertram
  • 270
  • 3
  • 12

5 Answers5

12

This falls within the general topic of "grammar induction"; searching on that phrase will turn up tons of literature. See, e.g., Inducing a context free grammar, https://en.wikipedia.org/wiki/Grammar_induction, https://cstheory.stackexchange.com/q/27347/5038.

For regular languages (rather than context-free ones), see also Is regex golf NP-Complete?, Smallest DFA that accepts given strings and rejects other given strings, Are there improvements on Dana Angluin's algorithm for learning regular sets, and https://cstheory.stackexchange.com/q/1854/5038.

D.W.
  • 167,959
  • 22
  • 232
  • 500
9

If the number of strings is finite say set $S=\{s_1,s_2....s_m\}$ you can always come up with context free grammar that generates all those strings, let $A$ be a non terminal then the rule can be $A \to s_1|s_2|...s_n$. For a finite set of strings you can even come up with a finite state automata that accepts only those strings. So the case of finite set of strings is really trivial.

advocateofnone
  • 3,179
  • 1
  • 27
  • 44
3

There are lots of ways, so you need to impose additional criteria on the quality of the results. Examples:

  1. List: For each string $w$ in the language, have a rule $S \rightarrow w$. Let $S$ be the starting nonterminal. Done.
  2. Prefix tree: For each prefix $w$ of a string in the language, have the nonterminal $X_w$. For each string $w_1xw_2$ in the language, where $x$ is a symbol, have the rule $X_{w_1} \rightarrow xX_{w_2}$. For each string $w$ in the language, have rule $X_w \rightarrow \epsilon$. Let $X_\epsilon$ be the starting nonterminal. Done.
  3. Suffix tree: the same, reversed.
  4. Applying an algorithm guaranteed to produce a grammar of minimal size, e.g. with the minimal number of rules. I don't know how hard this is.
reinierpost
  • 6,294
  • 1
  • 24
  • 40
3

What you are asking is akin to a search index. Indeed Finite State Transducers can be created and used to recognize text fed to them. For exameple, Lucene uses this algorithm: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

For a practical use, check out this blog post by Andrew Gallant: Index 1,600,000,000 Keys with Automata and Rust

In the post he describes a method to construct a FSA given a corpus of text such that it recognizes all the words. The end result is to construct an approximately minimal FST from pre-sorted keys in linear time and in constant memory.

FSA sharing prefixes and suffixes

The implementation is available in his fst library: https://github.com/BurntSushi/fst

lkraider
  • 131
  • 1
1

An answer to the question posed by reinierpost which also answers the original question:

We construct the dictionary automaton as follows:

  1. construct an automaton that reads and accepts exactly the first string.
  2. for the next string, start reading it with the automaton until for some letter there is no transition. start a new branch for the rest of the string. repeat until all strings are processed

The maximal size of the automaton is the total length of the input strings. Assuming that you can simulate transitions and create new ones in constant time, also the runtime is the total length of the input strings. No best or worst cases.

This automaton is minimal. since in the regular case automata and grammars correspond almost one to one, the same is true for the grammar, Of course, it is impossible to construct something of size n in less than n time.

Peter Leupold
  • 777
  • 4
  • 9