简体   繁体   English

使用pdfminer.six从每个PDF页面提取文本

[英]Extracting text from each PDF page using pdfminer.six

The documentation for pdfminer is poor at best. pdfminer的文档充其量是最差的。 I was initially using pdfminer and had it working for some PDF files then I ran into some bugs and realized I should be using pdfminer.six 我最初使用的是pdfminer,并使其适用于某些PDF文件,然后遇到了一些错误,意识到我应该使用pdfminer.six

I want to extract the text from each page of the PDF so that way I can keep tabs on where I found specific words and such. 我想从PDF的每一页中提取文本,这样我就可以在找到特定单词等的地方保持标签。

Using the documentation: 使用文档:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)

We've parsed all the pages but there is no documentation on how to get what elements or anything from the PDFpage 我们已经解析了所有页面,但是没有关于如何从PDF页面获取元素或任何内容的文档

I looked though the PDFPage.py file for a way to extract the text from each PDF page and of course it's not that simple. 我通过PDFPage.py文件寻找了一种从每个PDF页面提取文本的方法,当然,它并不是那么简单。

To complicate matters theres at least 3 versions of pdfminer and of course over time things have been upgraded so any examples I can find are not compatible. 使事情变得复杂的是,至少有3个版本的pdfminer,并且随着时间的推移当然已经升级了,所以我能找到的任何示例都不兼容。

Here is the version I'm using for extracting text from pdf files. 这是我用来从pdf文件提取文本的版本。

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


def extract_text_from_pdf(pdf_path):
    """
    This function extracts text from pdf file and return text as string.
    :param pdf_path: path to pdf file.
    :return: text string containing text of pdf.
    """
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text
    return None

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 pdfminer.six 从 URL 打开 PDF - Open a PDF, from a URL, with pdfminer.six 尝试使用 pdfminer.six 从 pdf 文件中提取文本时出错 - Error while trying to extract text from pdf file using pdfminer.six 使用 python 中的 PDFMiner 从 PDF 文件中提取文本? - Extracting text from a PDF file using PDFMiner in python? 如何在python脚本和外部命令行中使用pdfminer.six的pdf2txt.py? - How to use pdfminer.six's pdf2txt.py in python script and outside command line? 尝试使用pdfminer.six提取文本时如何解决“ UnicodeDecodeError”? - How can I fix 'UnicodeDecodeError' when trying to extract text with pdfminer.six? pdfminer3 没有从彩色 pdf 页面中提取文本,如何将 pdf 页面转换为灰度? - pdfminer3 is not extracting the text from colored pdf pages, how to convert pdf page into grayscale? 我找不到提取带下划线文本的方法,不能用 pdfminer.six 完成吗? - I cannot find a way to extract underlined text, cant it be done with pdfminer.six? 无法在Windows 10上安装pdfminer.six - Can't install pdfminer.six on Windows 10 使用pdfminer从PDF文件中提取每个单词的坐标 - Extract the coordinates of each word from PDF file using pdfminer 如何在python中使用pdfminer从在线PDF中提取文本 - How to extract text from online PDF using pdfminer in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM