Dictionary based statistical NER learner

Question

If I have to paraphrase the current NER methodologies, it generally finds patterns in strings and creates its own "vocabulary", so to speak.

Naturally, it would perform like a charm with a mammoth dataset, curated carefully with hardwork and labeled entities.

But what if, the system is first introduced with a dictionary of named entities for respective categories and then given a sample literature or simple tweets for that category, for eg, and from them it "learns" how those named entites appear in a context.

The difference is subtle from regex based, in that in regex it would try to match strings and more the dictionary size, more the rules, then more its usefulness.

But in this system, it would actually learn how "eating an apple" and "eating at Apple" both can be accurately classified with little training set, the subtleness of grammar when mentioning about fruit and about a company.

Some intuition on its implementation in CRF++, CRFsuite, Stanford or any other.

General disclaimer
Not a data scientist. Just a passing thought from an ML enthusiast.

score 1 · Answer 1 · answered Jun 28 '17 at 10:38

It sounds like you're describing using a gazetteer and training, which is not particularly unusual. See here for example:

How do we go about identifying named entities? One option would be to look up each word in an appropriate list of names. For example, in the case of locations, we could use a gazetteer, or geographical dictionary, such as the Alexandria Gazetteer or the Getty Gazetteer.

Using a gazetteer has tradeoffs; see here for some discussion.

Here's an example of someone using it in CRF++ as a feature.

Dictionary based statistical NER learner

1 Answers1