5

I am looking for a method to parse semi-structured textual data, i.e. data poorly formatted but usually having a visual structure of a matrix which may vary a lot in content and number of items in it, which may have headers or not, which may be interpreted sometimes column-wise or row-wise, and so on.

I have read about the WHISK information extraction paper : https://homes.cs.washington.edu/~soderlan/soderland_ml99.pdf

but unfortunately, it is not very detailed and I have not been able to find a real-system implementing it, or even snippets of code.

Has anybody have an idea where I can find such help? Or suggest an alternative approach which may be suited to my problem?

Thank you in advance for your reply!

mic
  • 533
  • 7
  • 15

1 Answers1

5

Without a sample of your data, it's unclear what's the structure of your data and what tool is suitable to process it.

Here are some blind recommendations based on my experience:

  • If you just need some flexibilty parsing the text record, such as variable repeat number of certain field, or conditional parsing of fields, then you should check out this python library: http://construct.readthedocs.org/en/latest/ it allows you to first define a hirachcal structure of your data, and then apply this structure to parse information from a text file. It's especial useful when parsing binary files.
  • If you're looking for an algorithm that can actually "understand" your text data and "infer" the structure in a smart way. Then you might want to try graph based approach: http://kavita-ganesan.com/opinosis
imadcat
  • 276
  • 1
  • 4