使用pdfminer.six从每个PDF页面提取文本

Question

The documentation for pdfminer is poor at best. pdfminer的文档充其量是最差的。 I was initially using pdfminer and had it working for some PDF files then I ran into some bugs and realized I should be using pdfminer.six 我最初使用的是pdfminer，并使其适用于某些PDF文件，然后遇到了一些错误，意识到我应该使用pdfminer.six

I want to extract the text from each page of the PDF so that way I can keep tabs on where I found specific words and such. 我想从PDF的每一页中提取文本，这样我就可以在找到特定单词等的地方保持标签。

Using the documentation: 使用文档：

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)

We've parsed all the pages but there is no documentation on how to get what elements or anything from the PDFpage 我们已经解析了所有页面，但是没有关于如何从PDF页面获取元素或任何内容的文档

I looked though the PDFPage.py file for a way to extract the text from each PDF page and of course it's not that simple. 我通过PDFPage.py文件寻找了一种从每个PDF页面提取文本的方法，当然，它并不是那么简单。

To complicate matters theres at least 3 versions of pdfminer and of course over time things have been upgraded so any examples I can find are not compatible. 使事情变得复杂的是，至少有3个版本的pdfminer，并且随着时间的推移当然已经升级了，所以我能找到的任何示例都不兼容。

Answer 1

Here is the version I'm using for extracting text from pdf files. 这是我用来从pdf文件提取文本的版本。

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


def extract_text_from_pdf(pdf_path):
    """
    This function extracts text from pdf file and return text as string.
    :param pdf_path: path to pdf file.
    :return: text string containing text of pdf.
    """
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text
    return None

使用pdfminer.six从每个PDF页面提取文本

问题描述

1 个解决方案

解决方案1
0 2019-07-13 19:51:31

使用pdfminer.six从每个PDF页面提取文本

问题描述

1 个解决方案

解决方案1 0 2019-07-13 19:51:31

解决方案1
0 2019-07-13 19:51:31