简体   繁体   English

如何使用索引从 PDF 中提取所有文本

[英]How do I extract all of the text from a PDF using indexing

I am new to Python and coding in general.我是 Python 和一般编码的新手。 I'm trying to create a program that will OCR a directory of PDFs then extract the text so I can later pick out specific things.我正在尝试创建一个程序,该程序将 OCR 一个 PDF 目录,然后提取文本,以便我以后可以挑选出特定的东西。 However, I am having trouble getting pdfPlumber to extract all the text from all of the pages.但是,我无法让 pdfPlumber 从所有页面中提取所有文本。 You can index from start to an end, but if the end is unknown, it breaks because the index is out of range.您可以从开始到结束进行索引,但如果结束未知,则会因为索引超出范围而中断。

import ocrmypdf
import os
import requests
import pdfplumber
import re
import logging
import sys
import PyPDF2

## test folder C:\Users\adams\OneDrive\Desktop\PDF

user_direc = input("Enter the path of your files: ") 

#walks the path and prints out each PDF in the 
#OCRs the documents and skips any OCR'd pages.


for dir_name, subdirs, file_list in os.walk(user_direc):
    logging.info(dir_name + '\n')
    os.chdir(dir_name)
    for filename in file_list:
        file_ext = os.path.splitext(filename)[0--1]
        if file_ext == '.pdf':
            full_path = dir_name + '/' + filename
            print(full_path)
result = ocrmypdf.ocr(filename, filename, skip_text=True, deskew = True, optimize = 1) 
logging.info(result)

#the next step is to extract the text from each individual document and print

directory = os.fsencode(user_direc)
    
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)  

As is, this will only take the text from the first page of each PDF.照原样,这只会从每个 PDF 的第一页获取文本。 I want to extract all of the text from each PDF but pdfPlumber will break if my index is too large and I do not know the number of pages the PDF will have.我想从每个 PDF 中提取所有文本,但是如果我的索引太大并且我不知道 PDF 将拥有的页数,pdfPlumber 会中断。 I've tried我试过了

page = pdf.pages[0--1]

but this breaks as well.但这也会中断。 I have not been able to find a workaround with PyPDF2, either.我也找不到使用 PyPDF2 的解决方法。 I apologize if this sloppy code or unreadable.如果此代码草率或不可读,我深表歉意。 I've tried to add comments to kind of explain what I am doing.我试图添加评论来解释我在做什么。

The pdfplumber git page says pdfplumber.open returns an instance of the pdfplumber.PDF class. pdfplumber git 页面显示pdfplumber.open返回pdfplumber.PDF class 的实例。

That instance has the pages property which is a list of pdfplumber.Page instances - one per Page loaded from your pdf.该实例具有pages属性,该属性是pdfplumber.Page实例的列表 - 从您的 pdf 加载的每个Page一个。 Looking at your code, if you do:查看您的代码,如果您这样做:

total_pages = len(pdf.pages)

You should get the total pages for the currently loaded pdf.您应该获得当前加载的 pdf 的总页数。

To combine all the pdf's text into one giant text string, you could try the 'for in' operation.要将所有 pdf 的文本组合成一个巨大的文本字符串,您可以尝试“for in”操作。 Try changing your existing code:尝试更改现有代码:

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)  

To:至:

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        all_text = '' # new line
        with pdfplumber.open(file) as pdf:
            # page = pdf.pages[0] - comment out or remove line
            # text = page.extract_text() - comment out or remove line
            for pdf_page in pdf.pages:
               single_page_text = pdf_page.extract_text()
               print( single_page_text )
               # separate each page's text with newline
               all_text = all_text + '\n' + single_page_text
            print(all_text)
            # print(text) - comment out or remove line  

Rather than use the page's index value pdf.page[0] to access individual pages, use for pdf_page in pdf.pages .与其使用页面的索引值pdf.page[0]来访问各个页面,不如使用for pdf_page in pdf.pages It will stop looping after it reaches the last page without generating an Exception.它会在到达最后一页后停止循环而不产生异常。 You won't have to worry about using an index value that's out of range.您不必担心使用超出范围的索引值。

If you encounter this error when you try the above mentioned code:如果您在尝试上述代码时遇到此错误:

fp = open(path_or_fp, "rb") FileNotFoundError: [Errno 2] No such file or directory:

this is because os.listdir() gives only filename and you have to join it with directory.这是因为 os.listdir() 仅提供文件名,您必须将其与目录连接。 The os.listdir() function will return names relative to the directory you're listing then. os.listdir() function 将返回与您列出的目录相关的名称。 You need to reconstruct the absolute path to open those files.您需要重建打开这些文件的绝对路径。

To resolve this error try the below code:要解决此错误,请尝试以下代码:

import os
import pdfplumber

directory = r'C:\Users\foo\folder'

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        fullpath = os.path.join(directory, filename)
        #print(fullpath)
        all_text = ""
        with pdfplumber.open(fullpath) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                #print(text)
                all_text += '\n' + text
        print(all_text)

Reference: Extract text from pdf file using pdfplumber参考: 使用 pdfplumber 从 pdf 文件中提取文本

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM