Extracting and Mining PDF Data

Question

I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!

score 4 · Answer 1 · answered Jan 09 '20 at 00:03

in my opinion, you have 4 possibilities:

You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use an external tool, to convert your pdf file to excel or CSV, then use required python module to open the excel/CSV file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data

This answer comes from:

https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python/53050405

Your question is near similar to:

Regards

Extracting and Mining PDF Data

1 Answers1