使用 Python 將 PDF 文本提取到文本文件中 - 提取錯誤

Question

我想首先從 1 pdf 文件中提取所有文本並將其存儲到一個文本文件中。

這是我的代碼：

import PyPDF2
from pathlib import Path

with Path('C:/Users/Lui/Desktop/Test/file1.pdf').open(mode='rb') as pdf_file, open('Extracted/extractPDF.txt', 'w') as text_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    print(number_of_pages)
    for page_number in range(number_of_pages):   # use xrange in Py2
        page = read_pdf.getPage(page_number)
        page_content = page.extractText()
        print(page_content)
        text_file.write(page_content)

pdf 看起來像這樣：

但是，與缺少的單詞和間距相比，創建的文本文件看起來有所不同：

我究竟做錯了什么？ 我的目標是然后遍歷 1,000 個 PDF，所以我試圖讓 1 個示例首先工作。

Answer 1

嘗試使用pdftotext

import pdftotext

# Load your PDF
    with open(filename, "rb") as f:
        pdf = pdftotext.PDF(f)

    # If it's password-protected
    #with open("secure.pdf", "rb") as f:
    #    pdf = pdftotext.PDF(f, "secret")

    # How many pages?
    #print(len(pdf))

    # Iterate over all the pages
    #for page in pdf:
    #    print(page)

    data = "\n\n".join(pdf)
    # Read all the text into one string
    print(data)

這個 package 工作得更好，應該可以幫助你。

使用 Python 將 PDF 文本提取到文本文件中 - 提取錯誤

問題描述

1 個解決方案

解決方案1
1 已采納 2021-10-06 05:39:07

使用 Python 將 PDF 文本提取到文本文件中 - 提取錯誤

問題描述

1 個解決方案

解決方案1 1 已采納 2021-10-06 05:39:07

解決方案1
1 已采納 2021-10-06 05:39:07