Has anyone seen the following string classifier discussed?

Question

The closes related question I have found for this is Find string patterns preferably in regex for string streams, but it has no answer and is also a little less constrained as my idea.

Given a set of strings as for example:

Foo 25 bar zoo animals
Foo 50 bar zoo animals
Foo 1 bar and boo decorative plants
Foo animals
Foo zoo animals

the classifier should produce a hierarchy (tree) of generalized and specialized patterns and the input strings as the leaves of the tree.

Note in this idea the smallest sub-string unit would be words, not characters. I suppose doing it based on characters is an equivalent problem, so I'm not insisting that it be done based on words. But as a fundamental data structure to conceptualize this problem, we can just say a "sequence of tokens" where in my case the tokens would be words, but if you want to do it based on strings and characters, the structure is also a sequence of tokens where the tokens are the characters.

Now the output given the above example input would be the hierarchy like this:

Foo *
- Foo * bar *
  - Foo * bar zoo animals
    - Foo 25 bar zoo animals
    - Foo 50 bar zoo animals
  - Foo 1 and boo decorative plants
- Foo * animals
  - Foo animals
  - Foo * zoo animals
    - Foo zoo animals
    - Foo * bar zoo animals
      - Foo 25 bar zoo animals
      - Foo 50 bar zoo animals

The asterisk means that zero or more tokens may appear here.

I'm trying to create my own solution here, but I wonder that somebody should have come across this. Perhaps trying to reconstruct general regex patterns from a set of strings?

There is literature about grammar inference such as in this StackOverflow question https://stackoverflow.com/questions/15512918/grammatical-inference-of-regular-expressions-for-given-finite-list-of-representa but as I dig into these algorithms they appear to be taking the problem much more general than I care about. For example, I do not care about repeating substrings for example.

It looks like if I determine the longest common sub-sequences between more than one leaf sequence, I could get somewhere. But wanted to ask first, while I am trying to cobble my solution together.

UPDATE: I actually managed to program the whole thing yesterday. The solution I have is pretty straight forward, almost trivial.

identify patterns by pairwise comparison
recuse to compare patterns of patterns, until nothing new is added
for each pattern, collect all its strings and patterns that generated/matched it (not just pairs now)
build a tree by substituting the pattern under a pattern with its group of matching strings and sub-patterns
remove all strings from higher level patterns that are matched by its lower level patterns

And that's it. The pairwise comparison, replacing non matching sub-sequences with *, is pretty simple. Perhaps in my application I have a constraint that makes this simple, and that constraint is that it is highly unlikely that the same word occurs more than once in a string. This is why discussions of regular expression inference that always start with "abba" are harder than I need it to be.

Here is the result of my pattern tree generator using the exact strings in my initial question above:

Foo *
- Foo * animals
  - Foo animals
  - Foo * zoo animals
    - Foo * bar zoo animals
      - Foo 25 bar zoo animals
      - Foo 50 bar zoo animals
    - Foo zoo animals
- Foo * bar *
  - Foo * bar zoo animals
    - Foo 25 bar zoo animals
    - Foo 50 bar zoo animals
  - Foo 1 bar and boo decorative plants

It's similar to my original manually created tree.

Has anyone seen the following string classifier discussed?

0 Answers0