[英]Using pdfminer python to extract information from PDF file
当我尝试使用pdfminer
从 Spyder 中的 PDF 文件中提取某些信息时遇到了一个问题。 我按照pdfminer
官方文档尝试首先定义提取function;
# Define a pdf-to-txt function
def pdftotxt(path, new_name):
# Create a pdf parser
parser = PDFParser(path)
# Create an object storing information
document = PDFDocument(parser)
# Evaluate if extractable
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
else:
# Create a PDF resource management to restore resource
resmag = PDFResourceManager()
# Set a parameter for analysis
laparams = LAParams()
# Create a PDF object
# device = PDFDevice(resmag)
device = PDFPageAggregator(resmag,laparams=laparams)
# Create a PDF interpreter
interpreter = PDFPageInterpreter(resmag, device)
# Analyzing each page
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# Assign LTPage of this page
layout = device.get_result()
for y in layout:
if(isinstance(y,LTTextBoxHorizontal)):
with open("%s"%(new_name),'a',encoding="utf-8") as f:
f.write(y.get_text()+"\n")
# Get a PDF's directory to test
path = open('/keep_2.pdf')
pdftotxt(path, "pdfminer.txt")
但它返回一条错误消息:
File "<ipython-input-2-11f054ad4321>", line 31, in <module>
pdftotxt(path, "pdfminer.txt")
File "<ipython-input-2-11f054ad4321>", line 5, in pdftotxt
document = PDFDocument(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 557, in __init__
pos = self.find_xref(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 759, in find_xref
for line in parser.revreadlines():
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/psparser.py", line 268, in revreadlines
n = max(s.rfind(b'\r'), s.rfind(b'\n'))
TypeError: must be str, not bytes
谁能帮助解决这个错误? 我试图用谷歌搜索它,但似乎没有报告使用pdfminer
的类似问题。 非常感谢您提前提供的帮助。
将我的评论作为答案发布,这样对于滚动浏览的人来说,这看起来不像是一个悬而未决的问题:
代替open('/keep_2.pdf')
,使用open('/keep_2.pdf', 'rb')
以二进制模式打开。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.