8

i want to read data from pdf document. I use iText7:

var src = "<file location>";
var pdfDocument = new PdfDocument(new PdfReader(src));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
     var page = pdfDocument.GetPage(i);
     string text = PdfTextExtractor.GetTextFromPage(page, strategy);
     string processed = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
pdfDocument.Close();

It works, but doesn't recognize letters. All text looks like

"����������\n�������������������������\n�����������������������������������\n

It is in English, so I don't expect any problems with encoding. What is the cause of this issue and how can I fix it?

dariaamir
  • 93
  • 1
  • 1
  • 3
  • You don't need the text conversion. Can you extract text with Acrobat? If you can´t, it's game over. – Paulo Soares Mar 20 '20 at 09:51
  • @PauloSoares what do you mean by 'with acrobat'? – dariaamir Mar 20 '20 at 10:01
  • Adobe Acrobat Reader. Can you copy&paste the text having opened that pdf in Adobe Reader? Please share the pdf in question for analysis. – mkl Mar 20 '20 at 11:29
  • @dariaamir did you manage to solve this issue? – AdvanTiSS Feb 02 '21 at 22:07
  • 1
    @AdvanTiSS unfortunately, no. Issue was caused by a custom proprietary font, that was used in that document. I tried a few different libraries, but neither was able to read it. – dariaamir Feb 04 '21 at 06:09

1 Answers1

2

You don't need the conversion you're doing. Change the code to:

StringBuilder processed = new StringBuilder();

    for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
    {
         var page = pdfDocument.GetPage(i);
         string text = PdfTextExtractor.GetTextFromPage(page, strategy);
         processed.Append(text);
    }
auburg
  • 1,373
  • 2
  • 12
  • 22
  • I receive the same text there ����������\n�������������������������\n – dariaamir Mar 20 '20 at 09:42
  • @dariaamir while the issue with your document is a different one, you should indeed drop that conversion thing. It looks like it will only do harm and never anything helpful. Or can you explain why you use it? – mkl Mar 20 '20 at 11:32
  • As others have commented, does the pdf file display the expected text when opened in adobe acrobat reader? – auburg Mar 22 '20 at 11:17
  • @auburg yes, I can see it in acrobat reader. Seems, that the issue is with a custom font, that is used in this exact pdf.... – dariaamir Mar 23 '20 at 09:24
  • See if this helps https://stackoverflow.com/questions/31746950/how-to-load-custom-font-in-fontfactory-register-in-itext – auburg Mar 23 '20 at 10:23