無法使用 PyPDF2 讀取 pdf 文檔

Question

我正在嘗試從 pdf 文件中讀取一些文本。 我正在使用下面的代碼，但是當我嘗試獲取文本（ptext）時，返回的所有內容都是大小為 1 的字符串變量，並且為空。

為什么沒有返回文本？ 我嘗試過其他頁面和另一本 pdf 書，但同樣的事情，我似乎無法閱讀任何文字。

import PyPDF2

file = open(r'C:/Users/pdfs/test_file.pdf', 'rb')
fileReader = PyPDF2.PdfFileReader(file)

pageObj = fileReader.getPage(445)
ptext = pageObj.extractText()

Answer 1

我也有同樣的問題，我認為我的代碼有問題或諸如此類。 經過一番深入的研究、調試和調查，似乎 PyPDF2、PyPDF3、PyPDF4 包無法處理大文件......是的，我嘗試使用 20 頁的 PDF，無縫運行，但放入 50 頁以上的 PDF，並且 PyPDF 崩潰.

我唯一的建議是完全使用不同的包。 pdftotext是一個很好的推薦。 使用pip install pdftotext 。

Answer 2

我在閱讀我的 pdf 文件時遇到了類似的問題。 希望以下解決方案有所幫助。 我遇到這個問題的原因：我選擇的pdf實際上是一個掃描圖像。 我使用第三方網站創建了我的簡歷，該網站返回了 pdf。 在解析這種類型的文件時，我無法直接提取文本。

以下是睾丸工作代碼

from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
  
def readPdfFile(filePath):  
    pages = convert_from_path(filePath, 500)
    image_counter = 1
    #Part #1 : Converting PDF to images
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        page.save(filename, 'JPEG')
        image_counter = image_counter + 1
        
    #Part #2 - Recognizing text from the images using OCR
    filelimit = image_counter-1 # Variable to get count of total number of pages
  
    for i in range(1, filelimit + 1):
        filename = "page_"+str(i)+".jpg"
        text = str(((pytesseract.image_to_string(Image.open(filename)))))
        text = text.replace('-\n', '')    

    #Part 3 - Remove those temp files
    image_counter = 1
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        os.remove(filename)
        image_counter = image_counter + 1
    return text

無法使用 PyPDF2 讀取 pdf 文檔

問題描述

2 個解決方案

解決方案1
0 2020-10-07 11:27:03

解決方案2
0 2021-07-21 18:17:49

無法使用 PyPDF2 讀取 pdf 文檔

問題描述

2 個解決方案

解決方案1 0 2020-10-07 11:27:03

解決方案2 0 2021-07-21 18:17:49

解決方案1
0 2020-10-07 11:27:03

解決方案2
0 2021-07-21 18:17:49