简体   繁体   English

pyPDF2中的extractText()函数抛出错误

[英]extractText() function in pyPDF2 throws error

I am trying to extract text from PDFs so that I can analyze it but when I try to extract the text from a page I receive the following error. 我试图从PDF中提取文本,以便我可以分析它,但当我尝试从页面中提取文本时,我收到以下错误。

Traceback (most recent call last):
File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_comm.py", line 765, in doIt
    result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)

File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_vars.py", line 376, in evaluateExpression
    result = eval(compiled, updated_globals, frame.f_locals)

File "<string>", line 1, in <module>

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1701, in extractText
    content = ContentStream(content, self.pdf)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1783, in __init__
    stream = StringIO(stream.getData())

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\generic.py", line 801, in getData
    decoded._data = filters.decodeStreamData(self)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 228, in decodeStreamData
    data = ASCII85Decode.decode(data)

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in decode
    data = [y for y in data if not (y in ' \n\r\t')]

File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in <listcomp>
    data = [y for y in data if not (y in ' \n\r\t')]

TypeError: 'in <string>' requires string as left operand, not int

The relevant code sections follow: 相关代码部分如下:

from PyPDF2 import PdfFileReader

for PDF_Entry in self.PDF_List:
    Pdf_File = PdfFileReader(open(PDF_Entry, "rb"))
    for pg_idx in range(0, Pdf_File.getNumPages()):
        page_Content = Pdf_File.getPage(pg_idx).extractText()
        for line in page_Content.split("\n"):
            self.Analyse_Line(line)

The error is thrown at the extractText() line. 在extractText()行抛出错误。

It may be worth trying the latest version of PyPDF2, latest as I write this is 1.24. 可能值得尝试最新版本的PyPDF2,最新版我写的是1.24。

With that said, I have found the extractText() feature to be very fragile. 话虽如此,我发现extractText()功能非常脆弱。 It works on some documents, fails on others. 它适用于某些文档,不适用于其他文档。 See some open issues: 看一些未解决的问题:

https://github.com/mstamy2/PyPDF2/issues/180 and https://github.com/mstamy2/PyPDF2/issues/168 https://github.com/mstamy2/PyPDF2/issues/180https://github.com/mstamy2/PyPDF2/issues/168

I worked around the problem by using the Poppler command-line utility pdftotext instead, both to classify a doc as image vs text and to get all the content. 我通过使用Poppler命令行实用程序pdftotext解决了这个问题,将doc分类为图像vs文本并获取所有内容。 Has been extremely stable for me - I've run it on thousands of PDF documents. 对我来说非常稳定 - 我在成千上万的PDF文档上运行它。 In my experience it also extracts text without further ado from protected/encrypted PDFs. 根据我的经验,它还可以在不受保护/加密的PDF文件中提取文本。

For example (written for Python 2): 例如(为Python 2编写):

def consult_pdftotext(filename):
    '''
    Runs pdftotext to extract text of pages 1..3.
    Returns the count of characters received.

    `filename`: Name of PDF file to be analyzed.
    '''
    print("Running pdftotext on file %s" % filename, file=sys.stderr)
    # don't forget that final hyphen to say, write to stdout!!
    cmd_args = [ "pdftotext", "-f", "1", "-l", "3", filename, "-" ]
    pdf_pipe = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    std_out, std_err = pdf_pipe.communicate()
    count = len(std_out)
    return count

HTH HTH

You are doing two things in one line. 你在一行做两件事。 Try to break things done to get closer to the problem. 尝试打破事情,以更接近问题。 Change: 更改:

page_Content = Pdf_File.getPage(pg_idx).extractText()

into

page = Pdf_File.getPage(pg_idx)
page_Content = page.extractText()

To see where the error happens. 查看错误发生的位置。 Also run the program from the command line not from Eclipse just to make sure it is the same error. 也可以从命令行运行程序,而不是从Eclipse运行,以确保它是相同的错误。 You say it happens at extractText() but this line does not show up in the traceback. 你说它发生在extractText()但是这一行没有出现在回溯中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM