我的代碼從 PDF 個文件中提取文本，並比較信息。執行大尺寸 Pdf 時，我的代碼似乎失敗了

Question

我可以使用我的代碼來比較較小尺寸的 PDF，但是當它用於較大尺寸的 PDF 時，它會失敗並顯示各種錯誤消息。 下面是我的代碼：

`

import pdfminer
import pandas as pd
from time import sleep
from tqdm import tqdm
from itertools import chain
import slate



# List of pdf files to process
pdf_files = ['file1.pdf', 'file2.pdf']

# Create a list to store the text from each PDF
pdf1_text = []
pdf2_text = []

# Iterate through each pdf file
for pdf_file in tqdm(pdf_files):
    # Open the pdf file
    with open(pdf_file, 'rb') as pdf_now:
        # Extract text using slate
        text = slate.PDF(pdf_now)
        text = text[0].split('\n')
        if pdf_file == pdf_files[0]:    
            pdf1_text.append(text)
        else:
            pdf2_text.append(text)

    sleep(20)

pdf1_text = list(chain.from_iterable(pdf1_text))
pdf2_text = list(chain.from_iterable(pdf2_text))

differences = set(pdf1_text).symmetric_difference(pdf2_text)

## Create a new dataframe to hold the differences
differences_df = pd.DataFrame(columns=['pdf1_text', 'pdf2_text'])

# Iterate through the differences and add them to the dataframe
for difference in differences:
    # Create a new row in the dataframe with the difference from pdf1 and pdf2
    differences_df = differences_df.append({'pdf1_text': difference if difference in pdf1_text else '',
                                            'pdf2_text': difference if difference in pdf2_text else ''}, ignore_index=True)

# Write the dataframe to an excel sheet
differences_df = differences_df.applymap(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) else x)

differences_df.to_excel('differences.xlsx', index=False, engine='openpyxl')


import openpyxl

import re

# Load the Excel file into a dataframe
df = pd.read_excel("differences.xlsx")

# Create a condition to check the number of words in each cell
for column in ["pdf1_text", "pdf2_text"]:
    df[f"{column}_word_count"] = df[column].str.split().str.len()
    condition = df[f"{column}_word_count"] < 10
    # Drop the rows that meet the condition
    df = df[~condition]

for column in ["pdf1_text", "pdf2_text"]:
    df = df.drop(f"{column}_word_count", axis=1)


# Save the modified dataframe to a new Excel file
df.to_excel("differences.xlsx", index=False)

我得到的最后一個錯誤是這個。 任何人都可以通過代碼請 go 幫助我找到實際問題是什么。

TypeError: %d format: a real number is required, not bytes

Answer 1

如果您真的想將腳本的速度提高至少一個數量級，我建議使用 PyMuPDF 而不是 PyPDF2 或 pdfminer。 我通常測量小 10 到 35 倍 (.) 的持續時間，當然，沒有time.sleep() - 你為什么要人為地減慢處理速度？

以下是使用 PyMuPDF 閱讀兩個 PDF 的文本行的方式：

import fitz  # PyMuPDF

doc1 = fitz.open("file1.pdf")
doc2 = fitz.open("file2.pdf")

text1 = "\n".join([page.get_text() for page in doc1])
text2 = "\n".join([page.get_text() for page in doc2])

lines1 = text1.splitlines()
lines2 = text2.splitlines()

# then do your comparison ...

我的代碼從 PDF 個文件中提取文本，並比較信息。執行大尺寸 Pdf 時，我的代碼似乎失敗了

問題描述

1 個解決方案

解決方案1
0 2023-02-01 10:20:46

我的代碼從 PDF 個文件中提取文本，並比較信息。 執行大尺寸 Pdf 時，我的代碼似乎失敗了

問題描述

1 個解決方案

解決方案1 0 2023-02-01 10:20:46

我的代碼從 PDF 個文件中提取文本，並比較信息。執行大尺寸 Pdf 時，我的代碼似乎失敗了

解決方案1
0 2023-02-01 10:20:46