semi-structured text parsing using machine learning

Question

I am looking for a method to parse semi-structured textual data, i.e. data poorly formatted but usually having a visual structure of a matrix which may vary a lot in content and number of items in it, which may have headers or not, which may be interpreted sometimes column-wise or row-wise, and so on.

I have read about the WHISK information extraction paper : https://homes.cs.washington.edu/~soderlan/soderland_ml99.pdf

but unfortunately, it is not very detailed and I have not been able to find a real-system implementing it, or even snippets of code.

Has anybody have an idea where I can find such help? Or suggest an alternative approach which may be suited to my problem?

Thank you in advance for your reply!

score 5 · Answer 1 · answered Feb 11 '15 at 06:13

Without a sample of your data, it's unclear what's the structure of your data and what tool is suitable to process it.

Here are some blind recommendations based on my experience:

If you just need some flexibilty parsing the text record, such as variable repeat number of certain field, or conditional parsing of fields, then you should check out this python library: http://construct.readthedocs.org/en/latest/ it allows you to first define a hirachcal structure of your data, and then apply this structure to parse information from a text file. It's especial useful when parsing binary files.
If you're looking for an algorithm that can actually "understand" your text data and "infer" the structure in a smart way. Then you might want to try graph based approach: http://kavita-ganesan.com/opinosis

semi-structured text parsing using machine learning

1 Answers1