使用pdfminer python从PDF文件中提取信息

Question

当我尝试使用pdfminer从 Spyder 中的 PDF 文件中提取某些信息时遇到了一个问题。 我按照pdfminer官方文档尝试首先定义提取function；

# Define a pdf-to-txt function
def pdftotxt(path, new_name):
    # Create a pdf parser
    parser = PDFParser(path)
    # Create an object storing information
    document = PDFDocument(parser)
    # Evaluate if extractable
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        # Create a PDF resource management to restore resource
        resmag = PDFResourceManager()
        # Set a parameter for analysis
        laparams = LAParams()
        # Create a PDF object
        # device = PDFDevice(resmag)
        device = PDFPageAggregator(resmag,laparams=laparams)
        # Create a PDF interpreter
        interpreter = PDFPageInterpreter(resmag, device)
        # Analyzing each page
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
            # Assign LTPage of this page
            layout = device.get_result()
            for y in layout:
                if(isinstance(y,LTTextBoxHorizontal)):
                    with open("%s"%(new_name),'a',encoding="utf-8") as f:
                        f.write(y.get_text()+"\n")  

# Get a PDF's directory to test
path = open('/keep_2.pdf')
pdftotxt(path, "pdfminer.txt")

但它返回一条错误消息：

File "<ipython-input-2-11f054ad4321>", line 31, in <module>
    pdftotxt(path, "pdfminer.txt")

  File "<ipython-input-2-11f054ad4321>", line 5, in pdftotxt
    document = PDFDocument(parser)

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 557, in __init__
    pos = self.find_xref(parser)

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 759, in find_xref
    for line in parser.revreadlines():

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/psparser.py", line 268, in revreadlines
    n = max(s.rfind(b'\r'), s.rfind(b'\n'))

TypeError: must be str, not bytes

谁能帮助解决这个错误？ 我试图用谷歌搜索它，但似乎没有报告使用pdfminer的类似问题。 非常感谢您提前提供的帮助。

Answer 1

将我的评论作为答案发布，这样对于滚动浏览的人来说，这看起来不像是一个悬而未决的问题：

代替open('/keep_2.pdf') ，使用open('/keep_2.pdf', 'rb')以二进制模式打开。

使用pdfminer python从PDF文件中提取信息

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-07 08:08:29

使用pdfminer python从PDF文件中提取信息

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-07 08:08:29

解决方案1
0 已采纳 2020-06-07 08:08:29