+ 1

How to search list of Text in Pdf using Python? PyPDF2 library was not extracting text as expected, requesting other suggestions

I used PyPDF2 and it didn't extract text as expected, when the pdf has rich graphics and more number of pages. I request some other ideas to get rid of this issue.

python3 pypdf2

4th Dec 2020, 5:57 AM

Gowtham rajasekher

2 Réponses

+ 9

Hye Gowtham rajasekher Why don't you use textract? - http://textract.readthedocs.io/en/latest/ https://github.com/deanmalmgren/textract It supports many types of files including PDFs.. Example - import textract text = textract.process("path/to/file.extension") Hope helps✌️

4th Dec 2020, 7:02 AM

Piyush

Thanks for the suggestion I will try it and let you know... At now, I used a package called PDF Miner which extracted the text well for large number of pages and even if it is rich in graphics.

9th Dec 2020, 6:05 PM

Gowtham rajasekher