PDF 使用 python pytesseract 進行文本轉換

Question

我正在嘗試將許多 pdf 文件轉換為 txt。 我的 pdf 文件組織在目錄內的子目錄中。 所以我有三層：目錄-->子目錄-->每個子目錄下有多個pdf文件。 我正在使用以下代碼，這給了我這個錯誤ValueError: too many values to unpack (expected 3) 。 當我轉換單個目錄中的文件而不是多個子目錄中的文件時，該代碼有效。

這可能很簡單，但我無法理解它。 任何幫助將非常感激。 謝謝。

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"K:\pdf_files")

for pdf_path, dirs, files in pdfs:
    for file in files:
    convert_from_path(os.path.join(pdf_path, file), 500)

        for pageNum,imgBlob in enumerate(pages):
            text = pytesseract.image_to_string(imgBlob,lang='eng')

            with open(f'{pdf_path}.txt', 'a') as the_file:
                the_file.write(text)

Answer 1

如評論中所述，您需要os.walk ，而不是glob.glob 。 os.walk以遞歸方式為您提供目錄列表。 pdf_path是它當前列出的父目錄， dirs是目錄/文件夾列表，而files是該文件夾中的文件列表。

使用os.path.join()使用父文件夾和文件名形成完整路徑。

此外，不要不斷地附加到 txt 文件，只需在“頁面到文本”循環之外創建它。

import os

pdfs_dir = r"K:\pdf_files"

for pdf_path, dirs, files in os.walk(pdfs_dir):
    for file in files:
        if not file.lower().endswith('.pdf'):
            # skip non-pdf's
            continue
        
        file_path = os.path.join(pdf_path, file)
        pages = convert_from_path(file_path, 500)
        
        # change the file extension from .pdf to .txt, assumes
        # just one occurrence of .pdf in the name, as the extension
        with open(f'{file_path.replace(".pdf", ".txt")}', 'w') as the_file:  # write mode, coz one time
            for pageNum, imgBlob in enumerate(pages):
                text = pytesseract.image_to_string(imgBlob,lang='eng')
                the_file.write(text)

Answer 2

我剛剛通過添加*來指定目錄中的所有子目錄，以更簡單的方式解決了這個問題：

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"K:\pdf_files\*\*.pdf")

for pdf_path in pdfs:
    pages = convert_from_path(pdf_path, 500)

    for pageNum,imgBlob in enumerate(pages):
        text = pytesseract.image_to_string(imgBlob,lang='eng')

        with open(f'{pdf_path}.txt', 'a') as the_file:
            the_file.write(text)

PDF 使用 python pytesseract 進行文本轉換

問題描述

2 個解決方案

解決方案1
2 2021-04-08 00:02:17

解決方案2
1 已采納 2021-04-08 01:55:00

PDF 使用 python pytesseract 進行文本轉換

問題描述

2 個解決方案

解決方案1 2 2021-04-08 00:02:17

解決方案2 1 已采納 2021-04-08 01:55:00

解決方案1
2 2021-04-08 00:02:17

解決方案2
1 已采納 2021-04-08 01:55:00