The closes related question I have found for this is Find string patterns preferably in regex for string streams, but it has no answer and is also a little less constrained as my idea.
Given a set of strings as for example:
- Foo 25 bar zoo animals
- Foo 50 bar zoo animals
- Foo 1 bar and boo decorative plants
- Foo animals
- Foo zoo animals
the classifier should produce a hierarchy (tree) of generalized and specialized patterns and the input strings as the leaves of the tree.
Note in this idea the smallest sub-string unit would be words, not characters. I suppose doing it based on characters is an equivalent problem, so I'm not insisting that it be done based on words. But as a fundamental data structure to conceptualize this problem, we can just say a "sequence of tokens" where in my case the tokens would be words, but if you want to do it based on strings and characters, the structure is also a sequence of tokens where the tokens are the characters.
Now the output given the above example input would be the hierarchy like this:
- Foo *
- Foo * bar *
- Foo * bar zoo animals
- Foo 25 bar zoo animals
- Foo 50 bar zoo animals
- Foo 1 and boo decorative plants
- Foo * bar zoo animals
- Foo * animals
- Foo animals
- Foo * zoo animals
- Foo zoo animals
- Foo * bar zoo animals
- Foo 25 bar zoo animals
- Foo 50 bar zoo animals
- Foo * bar *
The asterisk means that zero or more tokens may appear here.
I'm trying to create my own solution here, but I wonder that somebody should have come ​across this. Perhaps trying to reconstruct general regex patterns from a set of strings?
There is literature about grammar inference such as in this StackOverflow question https://stackoverflow.com/questions/15512918/grammatical-inference-of-regular-expressions-for-given-finite-list-of-representa but as I dig into these algorithms they appear to be taking the problem much more general than I care about. For example, I do not care about repeating substrings for example.
It looks like if I determine the longest common sub-sequences between more than one leaf sequence, I could get somewhere. But wanted to ask first, while I am trying to cobble my solution together.
UPDATE: I actually managed to program the whole thing yesterday. The solution I have is pretty straight forward, almost trivial.
- identify patterns by pairwise comparison
- recuse to compare patterns of patterns, until nothing new is added
- for each pattern, collect all its strings and patterns that generated/matched it (not just pairs now)
- build a tree by substituting the pattern under a pattern with its group of matching strings and sub-patterns
- remove all strings from higher level patterns that are matched by its lower level patterns
And that's it. The pairwise comparison, replacing non matching sub-sequences with *, is pretty simple. Perhaps in my application I have a constraint that makes this simple, and that constraint is that it is highly unlikely that the same word occurs more than once in a string. This is why discussions of regular expression inference that always start with "abba" are harder than I need it to be.
Here is the result of my pattern tree generator using the exact strings in my initial question above:
- Foo *
- Foo * animals
- Foo animals
- Foo * zoo animals
- Foo * bar zoo animals
- Foo 25 bar zoo animals
- Foo 50 bar zoo animals
- Foo zoo animals
- Foo * bar zoo animals
- Foo * bar *
- Foo * bar zoo animals
- Foo 25 bar zoo animals
- Foo 50 bar zoo animals
- Foo 1 bar and boo decorative plants
- Foo * bar zoo animals
- Foo * animals
It's similar to my original manually created tree.