[英]extractText() function in pyPDF2 throws error
我試圖從PDF中提取文本,以便我可以分析它,但當我嘗試從頁面中提取文本時,我收到以下錯誤。
Traceback (most recent call last):
File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_comm.py", line 765, in doIt
result = pydevd_vars.evaluateExpression(self.thread_id, self.frame_id, self.expression, self.doExec)
File "C:\Program Files (x86)\eclipse\plugins\org.python.pydev_2.7.4.2013051601\pysrc\pydevd_vars.py", line 376, in evaluateExpression
result = eval(compiled, updated_globals, frame.f_locals)
File "<string>", line 1, in <module>
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1701, in extractText
content = ContentStream(content, self.pdf)
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\pdf.py", line 1783, in __init__
stream = StringIO(stream.getData())
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\generic.py", line 801, in getData
decoded._data = filters.decodeStreamData(self)
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 228, in decodeStreamData
data = ASCII85Decode.decode(data)
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in decode
data = [y for y in data if not (y in ' \n\r\t')]
File "C:\Python33\lib\site-packages\pypdf2-1.9.0-py3.3.egg\PyPDF2\filters.py", line 170, in <listcomp>
data = [y for y in data if not (y in ' \n\r\t')]
TypeError: 'in <string>' requires string as left operand, not int
相關代碼部分如下:
from PyPDF2 import PdfFileReader
for PDF_Entry in self.PDF_List:
Pdf_File = PdfFileReader(open(PDF_Entry, "rb"))
for pg_idx in range(0, Pdf_File.getNumPages()):
page_Content = Pdf_File.getPage(pg_idx).extractText()
for line in page_Content.split("\n"):
self.Analyse_Line(line)
在extractText()行拋出錯誤。
可能值得嘗試最新版本的PyPDF2,最新版我寫的是1.24。
話雖如此,我發現extractText()功能非常脆弱。 它適用於某些文檔,不適用於其他文檔。 看一些未解決的問題:
https://github.com/mstamy2/PyPDF2/issues/180和https://github.com/mstamy2/PyPDF2/issues/168
我通過使用Poppler命令行實用程序pdftotext解決了這個問題,將doc分類為圖像vs文本並獲取所有內容。 對我來說非常穩定 - 我在成千上萬的PDF文檔上運行它。 根據我的經驗,它還可以在不受保護/加密的PDF文件中提取文本。
例如(為Python 2編寫):
def consult_pdftotext(filename):
'''
Runs pdftotext to extract text of pages 1..3.
Returns the count of characters received.
`filename`: Name of PDF file to be analyzed.
'''
print("Running pdftotext on file %s" % filename, file=sys.stderr)
# don't forget that final hyphen to say, write to stdout!!
cmd_args = [ "pdftotext", "-f", "1", "-l", "3", filename, "-" ]
pdf_pipe = subprocess.Popen(cmd_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
std_out, std_err = pdf_pipe.communicate()
count = len(std_out)
return count
HTH
你在一行做兩件事。 嘗試打破事情,以更接近問題。 更改:
page_Content = Pdf_File.getPage(pg_idx).extractText()
成
page = Pdf_File.getPage(pg_idx)
page_Content = page.extractText()
查看錯誤發生的位置。 也可以從命令行運行程序,而不是從Eclipse運行,以確保它是相同的錯誤。 你說它發生在extractText()
但是這一行沒有出現在回溯中。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.