使用python逐行讀取pdf文件

Question

我使用以下代碼讀取 pdf 文件，但它不讀取它。 可能是什么原因？

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)

輸出是 [u''] 而不是讀取內容。

Answer 1

import re
from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")

for page in reader.pages:
    text = page.extractText()
    text_lower = text.lower()
    for line in text_lower:
        if re.search("abc", line):
            print(line)

我用它逐頁迭代pdf並在其中搜索關鍵術語並進一步處理。

Answer 2

也許這可以幫助您閱讀PDF。

import pyPdf
def getPDFContent(path):
    content = ""
    pages = 10
    p = file(path, "rb")
    pdf_content = pyPdf.PdfFileReader(p)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

Answer 3

我認為您需要指定光盤名稱，它在您的目錄中丟失。 例如“D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf”。 我試過了，我可以毫無問題地閱讀。

或者，如果您想使用與您的目錄沒有真正關聯的 os 模塊查找文件路徑，您可以嘗試以下操作：

from PyPDF2 import PdfFileReader
import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')

f = open(directory, 'rb')

reader = PdfFileReader(f)

contents = reader.getPage(0).extractText().split('\n')

f.close()

print(contents)

查找功能可以在 Nadia Alramli 的回答中找到 Find a file in python

Answer 4

要從目錄中的多個文件夾中讀取文件，可以使用以下代碼 - 此示例用於讀取 pdf 文件：

import os
from tika import parser

path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
    for file in f:
        if ".pdf" in file:  # reading only PDF files
            file_join = os.path.join(r, file)   #getting full path 
            file_data = parser.from_file(file_join)     # parsing the PDF file 
            text = file_data['content']               # read the content 
            print(text)                  #print the content

Answer 5

def getTextPDF(pdfFileName,password=''):
    import PyPDF2
    from PyPDF2 import PdfFileReader, PdfFileWriter
    from nltk import sent_tokenize
    """ Extract Text from pdf  """
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    text = '\n'.join (text).replace("\n",'')
    text = sent_tokenize(text)
    return text

Answer 6

問題是兩件事之一：（1）文本不在第一頁 - 因此是用戶錯誤。 (2) PyPDF2 無法提取文本 - 因此 PyPDF2 中存在錯誤。

可悲的是，對於某些 PDF，第二個仍然會發生。

Answer 7

你好拉胡爾·皮帕利亞，

如果沒有在你的 python 中安裝PyPDF2 ，那么在使用這個模塊后首先安裝PyPDF2 。

Ubuntu的安裝步驟（安裝python-pypdf）

一、打開terminal
輸入sudo apt-get install python-pypdf

您的問題解決方案

試試下面的代碼，

# Import Library
import PyPDF2

# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)

page_content = page.extractText()

# Display content of the pdf
print page_content

從以下鏈接下載 PDF 並嘗試此代碼， https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

我希望我的回答有幫助。
如果有任何疑問，請發表評論。

使用python逐行讀取pdf文件

問題描述

7 個解決方案

解決方案1
4 2018-01-23 12:47:56

解決方案2
0 2017-07-08 04:16:20

解決方案3
0 2017-10-03 17:04:54

解決方案4
0 2019-12-21 11:37:31

解決方案5
0 2021-01-27 09:46:18

解決方案6
0 2022-05-14 11:59:36

解決方案7
-2 2017-07-08 04:35:01

Ubuntu的安裝步驟（安裝python-pypdf）

您的問題解決方案

使用python逐行讀取pdf文件

問題描述

7 個解決方案

解決方案1 4 2018-01-23 12:47:56

解決方案2 0 2017-07-08 04:16:20

解決方案3 0 2017-10-03 17:04:54

解決方案4 0 2019-12-21 11:37:31

解決方案5 0 2021-01-27 09:46:18

解決方案6 0 2022-05-14 11:59:36

解決方案7 -2 2017-07-08 04:35:01

Ubuntu的安裝步驟（安裝python-pypdf）

您的問題解決方案

解決方案1
4 2018-01-23 12:47:56

解決方案2
0 2017-07-08 04:16:20

解決方案3
0 2017-10-03 17:04:54

解決方案4
0 2019-12-21 11:37:31

解決方案5
0 2021-01-27 09:46:18

解決方案6
0 2022-05-14 11:59:36

解決方案7
-2 2017-07-08 04:35:01