创建可搜索（多页）PDF 和 Python

Question

I've found some guides online on how to make a PDF searchable if it was scanned.我在网上找到了一些关于如何使 PDF 被扫描后可搜索的指南。 However, I'm currently struggling with figuring out how to do it for a multipage PDF.但是，我目前正在努力弄清楚如何为多页 PDF 执行此操作。

My code takes multipaged PDFs, converts each page into a JPG, runs OCR on each page and then converts it into a PDF. However, only the last page is returned.我的代码采用多页 PDF，将每一页转换为 JPG，在每一页上运行 OCR，然后将其转换为 PDF。但是，仅返回最后一页。

import pytesseract
from pdf2image import convert_from_path

pytesseract.pytesseract.tesseract_cmd = 'directory'
TESSDATA_PREFIX = 'directory'
tessdata_dir_config = '--tessdata-dir directory'

# Path of the pdf
PDF_file = r"pdf directory"
  
  
def pdf_text():
    
    # Store all the pages of the PDF in a variable
    pages = convert_from_path(PDF_file, 500)
  
    image_counter = 1

    for page in pages:

        # Declare file names
        filename = "page_"+str(image_counter)+".jpg"

        # Save the image of the page in system
        page.save(filename, 'JPEG')

        # Increment the counter to update filename
        image_counter = image_counter + 1

    # Variable to get count of total number of pages
    filelimit = image_counter-1

    outfile = "out_text.pdf"

    # Open the file in append mode so that all contents of all images are added to the same file
    
    f = open(outfile, "a")

    # Iterate from 1 to total number of pages
    for i in range(1, filelimit + 1):

        filename = "page_"+str(i)+".jpg"

        # Recognize the text as string in image using pytesseract
        result =  pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config) 

            
        f = open(outfile, "w+b")
        f.write(bytearray(result))
        f.close()

pdf_text()

How can I run this for all pages and output one merged PDF?我怎样才能对所有页面和 output 一个合并的 PDF 运行这个？

Answer 1

I can't run it but I think all problem is because you use open(..., 'w+b') inside loop - and this remove previous content, and finally you write only last page.我无法运行它，但我认为所有问题都是因为您在循环内使用open(..., 'w+b') - 这会删除以前的内容，最后您只写最后一页。

You should use already opened file open(outfile, "a") and close it after loop.您应该使用已经打开的文件open(outfile, "a")并在循环后关闭它。

# --- before loop ---

f = open(outfile, "ab")

# --- loop ---

for i in range(1, filelimit+1):

    filename = f"page_{i}.jpg"

    result =  pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config) 

    f.write(bytearray(result))

# --- after loop ---
        
f.close()

BTW:顺便提一句：

But there is other problem - image_to_pdf_or_hocr creates full PDF - with special headers and maybe footers - and appending two results can't create correct PDF .但还有其他问题 - image_to_pdf_or_hocr创建完整PDF - 带有特殊的页眉和页脚 - 并且附加两个结果无法创建正确PDF 。 You would have to use special modules to merge pdfs.您将不得不使用特殊模块来合并 pdf。 Like Merge PDF files点赞合并 PDF 个文件

Something similar to类似于

    # --- before loop ---
    
    from PyPDF2 import PdfFileMerger
    import io

    merger = PdfFileMerger()

    # --- loop ---
    
    for i in range(1, filelimit + 1):

        filename = "page_"+str(i)+".jpg"

        result =  pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config)
        
        pdf_file_in_memory = io.BytesIO(result)        
        merger.append(pdf_file_in_memory)
        
    # --- after loop ---
    
    merger.write(outfile)
    merger.close()

Answer 2

There are a number of potential issues here and without being able to debug it's hard to say what is the root cause.这里有许多潜在的问题，如果无法调试，很难说出根本原因是什么。

Are the JPGs being successfully created, and as separate files as is expected? JPG 是否已成功创建，并且是否如预期的那样作为单独的文件？

I would suspect that pages = convert_from_path(PDF_file, 500) is not returning as expected - have you manually verified they are being created as expected?我怀疑pages = convert_from_path(PDF_file, 500)没有按预期返回 - 您是否手动验证它们是否按预期创建？

创建可搜索（多页）PDF 和 Python

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-08-16 12:54:01

解决方案2
0 2021-08-16 11:00:41

创建可搜索（多页）PDF 和 Python

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-08-16 12:54:01

解决方案2 0 2021-08-16 11:00:41

解决方案1
2 已采纳 2021-08-16 12:54:01

解决方案2
0 2021-08-16 11:00:41