[英]Extracting PDF text into a text file using Python - Extraction error
我想首先從 1 pdf 文件中提取所有文本並將其存儲到一個文本文件中。
這是我的代碼:
import PyPDF2
from pathlib import Path
with Path('C:/Users/Lui/Desktop/Test/file1.pdf').open(mode='rb') as pdf_file, open('Extracted/extractPDF.txt', 'w') as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print(number_of_pages)
for page_number in range(number_of_pages): # use xrange in Py2
page = read_pdf.getPage(page_number)
page_content = page.extractText()
print(page_content)
text_file.write(page_content)
但是,與缺少的單詞和間距相比,創建的文本文件看起來有所不同:
我究竟做錯了什么? 我的目標是然后遍歷 1,000 個 PDF,所以我試圖讓 1 個示例首先工作。
嘗試使用pdftotext
import pdftotext
# Load your PDF
with open(filename, "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
#with open("secure.pdf", "rb") as f:
# pdf = pdftotext.PDF(f, "secret")
# How many pages?
#print(len(pdf))
# Iterate over all the pages
#for page in pdf:
# print(page)
data = "\n\n".join(pdf)
# Read all the text into one string
print(data)
這個 package 工作得更好,應該可以幫助你。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.