I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!
1 Answers
in my opinion, you have 4 possibilities:
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use an external tool, to convert your pdf file to excel or CSV, then use required python module to open the excel/CSV file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
This answer comes from:
Your question is near similar to:
https://stackoverflow.com/questions/27927880/extracting-tables-from-a-pdf
https://stackoverflow.com/questions/17591426/extract-table-from-a-pdf
https://stackoverflow.com/questions/25125178/how-to-scrape-tables-in-thousands-of-pdf-files
https://stackoverflow.com/questions/29868541/pdf-data-and-table-scraping-to-excel
Regards
- 6,430
- 2
- 20
- 51