简体   繁体   English

如何在python中将整个pdf转换为文本

[英]How to convert whole pdf to text in python

I have to convert whole pdf to text.我必须将整个pdf转换为文本。 i have seen at many places converting pdf to text but particular page.我在很多地方都看到将 pdf 转换为文本但特定页面。

 from PyPDF2 import PdfFileReader
    import os
    def text_extractor(path):
        with open(os.path.join(path,file), 'rb') as f:
            pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
            page = pdf.getPage(0)
            text = page.extractText()
            print(text)
    if __name__ == '__main__':
        path="C:\\Users\\AAAA\\Desktop\\BB"
        for file in os.listdir(path):
            if not file.endswith(".pdf"):
                continue
            text_extractor(path)

How to convert whole pdf file to text without using getpage()??如何在不使用 getpage() 的情况下将整个 pdf 文件转换为文本?

You may want to use textract as this answer recommends to get the full document if all you want is the text.如果您想要的只是文本,您可能希望使用 textract ,因为此答案建议获取完整文档。

If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:如果要使用 PyPDF2,则可以先获取页数,然后遍历每个页面,例如:

 from PyPDF2 import PdfFileReader
    import os
    def text_extractor(path):
        with open(os.path.join(path,file), 'rb') as f:
            pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
            text = ""
            for page_num in range(pdf.getNumPages()):
                page = pdf.getPage(page_num)
                text += page.extractText()
            print(text)
    if __name__ == '__main__':
        path="C:\\Users\\AAAA\\Desktop\\BB"
        for file in os.listdir(path):
            if not file.endswith(".pdf"):
                continue
            text_extractor(path)

Though you may want to remember which page the text came from in which case you could use a list:尽管您可能想记住文本来自哪个页面,在这种情况下您可以使用列表:

page_text = []
for page_num in range(pdf.getNumPages()): # For each page
    page = pdf.getPage(page_num) # Get that page's reference
    page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
    print(page) # print each page

You could use tika to accomplish this task, but the output needs a little cleaning.您可以使用tika来完成此任务,但输出需要一些清理。

from tika import parser

parse_entire_pdf = parser.from_file('mypdf.pdf', xmlContent=True)
parse_entire_pdf = parse_entire_pdf['content']
print (parse_entire_pdf)

This answer uses PyPDF2 and encode('utf-8') to keep the output per page together.此答案使用 PyPDF2 和encode('utf-8')将每页的输出保持在一起。

from PyPDF2 import PdfFileReader

def pdf_text_extractor(path):
  with open(path, 'rb') as f:
  pdf = PdfFileReader(f)

  # Get total pdf page number.
  totalPageNumber = pdf.numPages

  currentPageNumber = 0

  while (currentPageNumber < totalPageNumber):
    page = pdf.getPage(currentPageNumber)

    text = page.extractText()
    # The encoding put each page on a single line.  
    # type is <class 'bytes'>
    print(text.encode('utf-8'))

    #################################
    # This outputs the text to a list,
    # but it doesn't keep paragraphs 
    # together 
    #################################
    # output = text.encode('utf-8')
    # split = str(output, 'utf-8').split('\n')
    # print (split)
    #################################

    # Process next page.
    currentPageNumber += 1

path = 'mypdf.pdf'
pdf_text_extractor(path)

Try pdfreader .试试pdfreader You can extract either plain text or decoded text containing "pdf markdown":您可以提取包含“pdf markdown”的纯文本或解码文本:

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""

try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

PDF is a page-oriented format & therefore you'll need to deal with the concept of pages. PDF 是一种面向页面的格式,因此您需要处理页面的概念。

What makes it perhaps even more difficult, you're not guaranteed that the text excerpts you're able to extract are extracted in the same order as they are presented on the page: PDF allows one to say "put this text within a 4x3 box situated 1" from the top, with a 1" left margin.", and then I can put the next set of text somewhere else on the same page.可能更困难的是,您不能保证您能够提取的文本摘录的提取顺序与它们在页面上的显示顺序相同:PDF 允许人们说“将此文本放入 4x3 框中位于距顶部 1" 处,左边距为 1"。",然后我可以将下一组文本放在同一页面的其他位置。

Your extractText() function simply gets the extracted text blocks in document order, not presentation order.您的 extractText() 函数只是按文档顺序而不是演示顺序获取提取的文本块。

Tables are notoriously difficult to extract in a common, meaningful way... You see them as tables, PDF sees them as text blocks placed on the page with little or no relationship.众所周知,表格很难以一种常见的、有意义的方式提取……您将它们视为表格,PDF 将它们视为放置在页面上几乎没有关系或没有关系的文本块。

Still, getPage() and extractText() are good starting points & if you have simply formatted pages, they may work fine.尽管如此, getPage() 和 extractText() 是很好的起点,如果您只是格式化页面,它们可能会正常工作。

I found out a very simple way to do this.我发现了一个非常简单的方法来做到这一点。

You have to follow this steps:您必须按照以下步骤操作:

  1. Install PyPDF2 :To do this step if you use Anaconda, search for Anaconda Prompt and digit the following command, you need administrator permission to do this.安装 PyPDF2 :如果您使用 Anaconda,要执行此步骤,请搜索Anaconda Prompt并输入以下命令,您需要管理员权限才能执行此操作。

    pip install PyPDF2

If you're not using Anaconda you have to install pip and put its path to your cmd or terminal.如果您不使用 Anaconda,则必须安装 pip 并将其路径放入您的 cmd 或终端。

  1. Python Code : This following code shows how to convert a pdf file very easily: Python 代码:以下代码显示了如何非常轻松地转换 pdf 文件:

     import PyPDF2 with open("pdf file path here",'rb') as file_obj: pdf_reader = PyPDF2.PdfFileReader(file_obj) raw = pdf_reader.getPage(0).extractText() print(raw)

I just used pdftotext module to get this done easily.我只是使用 pdftotext 模块来轻松完成这项工作。

import pdftotext

# Load your PDF
with open("test.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# creating a text file after iterating through all pages in the pdf
file = open("test.txt", "w")
for page in pdf:
    file.write(page)
file.close()

Link: https://github.com/manojitballav/pdf-text链接: https : //github.com/manojitballav/pdf-text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM