简体繁体中英

PyPDF2 can't read non-English characters, returns empty string on extractText()

原文 2020-02-24 13:06:57 8 1 python/ python-3.x/ pdf/ web-scraping/ pypdf2

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long) that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page contents, then it returns an empty string.

I'm new to this library and i don't know what to do, to fix this problem!!

1 answers

PyPDF2's "Extract Text" looks like it will either Work Just Fine, or Fail Completely. There's no parameters you can pass in to try to get things to work properly. It'll work or it won't.

You may not be able to fix this problem. If you can successfully copy/paste the text in Acrobat/Reader, then it's possible to extract the text. So what happens when you try to copy/paste out of Reader? Don't try this with some other third party PDF viewer, use Adobe software. You'll probably have to abandon PyPDF2 and move on to some other PDF API, but if Reader can do it, it's a fixable problem.

There are three different things in a PDF that can look like letters to the human eye.

Letters in the PDF in some text encoding. There are several fixed encodings, plus PDF allows you to embed your own custom encodings (often used with font subsets). Software can create PDFs that look fine but can't really be copy/pasted from, even by Adobe.
Path art that just happens to look an awful lot like letters. "Start drawing a line here, draw a straight line to there, then a curve like this to there" and so on. If you're curious, PDF uses Bezier curves to define its curves. Not terribly related to your question, but interesting.
Bit maps (.jpeg/gif/etc images) that define a grid of pixels.

In the past, Reader has only been able to handle text type 1 above, and then only if the text was encoded properly. Broken custom encodings are alarmingly common (or were 7+ years ago when I stopped working on PDF software).

With broken type 1s, and all of 2 and 3, the only thing you can do is to run OCR on the PDF. OCR: Optical Character Recognition. There are several open source OCR projects out there, as well as commercial ones.

Python - pypdf2 extractText() not working

extractText() function in pyPDF2 throws error

can't read pdf document using PyPDF2

Correct length of a string of non-English characters in Python3

Python - non-English characters don't work in one case

Can't open PDF file with PyPDF2

PyPDF2 can't use getData

Can't store non-english name in mysql table properly

Encoding in Python - non-English characters into a URL

Spelling corrector for non-English characters

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Python - pypdf2 extractText() not working extractText() function in pyPDF2 throws error can't read pdf document using PyPDF2 Correct length of a string of non-English characters in Python3 Python - non-English characters don't work in one case Can't open PDF file with PyPDF2 PyPDF2 can't use getData Can't store non-english name in mysql table properly Encoding in Python - non-English characters into a URL Spelling corrector for non-English characters

Related Tags

PyPDF2 can't read non-English characters, returns empty string on extractText()

Question

1 answers

solution1 1 ACCPTED 2020-02-24 14:03:58

solution1
1 ACCPTED 2020-02-24 14:03:58