繁体   English   中英

使用pdfminer python从PDF文件中提取信息

[英]Using pdfminer python to extract information from PDF file

当我尝试使用pdfminer从 Spyder 中的 PDF 文件中提取某些信息时遇到了一个问题。 我按照pdfminer官方文档尝试首先定义提取function;

# Define a pdf-to-txt function
def pdftotxt(path, new_name):
    # Create a pdf parser
    parser = PDFParser(path)
    # Create an object storing information
    document = PDFDocument(parser)
    # Evaluate if extractable
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        # Create a PDF resource management to restore resource
        resmag = PDFResourceManager()
        # Set a parameter for analysis
        laparams = LAParams()
        # Create a PDF object
        # device = PDFDevice(resmag)
        device = PDFPageAggregator(resmag,laparams=laparams)
        # Create a PDF interpreter
        interpreter = PDFPageInterpreter(resmag, device)
        # Analyzing each page
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
            # Assign LTPage of this page
            layout = device.get_result()
            for y in layout:
                if(isinstance(y,LTTextBoxHorizontal)):
                    with open("%s"%(new_name),'a',encoding="utf-8") as f:
                        f.write(y.get_text()+"\n")  

# Get a PDF's directory to test
path = open('/keep_2.pdf')
pdftotxt(path, "pdfminer.txt")

但它返回一条错误消息:

File "<ipython-input-2-11f054ad4321>", line 31, in <module>
    pdftotxt(path, "pdfminer.txt")

  File "<ipython-input-2-11f054ad4321>", line 5, in pdftotxt
    document = PDFDocument(parser)

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 557, in __init__
    pos = self.find_xref(parser)

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 759, in find_xref
    for line in parser.revreadlines():

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/psparser.py", line 268, in revreadlines
    n = max(s.rfind(b'\r'), s.rfind(b'\n'))

TypeError: must be str, not bytes

谁能帮助解决这个错误? 我试图用谷歌搜索它,但似乎没有报告使用pdfminer的类似问题。 非常感谢您提前提供的帮助。

将我的评论作为答案发布,这样对于滚动浏览的人来说,这看起来不像是一个悬而未决的问题:

代替open('/keep_2.pdf') ,使用open('/keep_2.pdf', 'rb')以二进制模式打开。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM