简体   繁体   English

使用python逐行读取pdf文件

[英]Reading pdf files line by line using python

I used the following code to read the pdf file, but it does not read it.我使用以下代码读取 pdf 文件,但它不读取它。 What could possibly be the reason?可能是什么原因?

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)

The output is [u''] instead of reading the content.输出是 [u''] 而不是读取内容。

import re
from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")

for page in reader.pages:
    text = page.extractText()
    text_lower = text.lower()
    for line in text_lower:
        if re.search("abc", line):
            print(line)

I use it to iterate page by page of pdf and search for key terms in it and process further.我用它逐页迭代pdf并在其中搜索关键术语并进一步处理。

May be this can help you to read PDF.也许这可以帮助您阅读PDF。

import pyPdf
def getPDFContent(path):
    content = ""
    pages = 10
    p = file(path, "rb")
    pdf_content = pyPdf.PdfFileReader(p)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

I think you need to specify the disc name, it's missing in your directory.我认为您需要指定光盘名称,它在您的目录中丢失。 For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf".例如“D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf”。 I tried and I can read without any problem.我试过了,我可以毫无问题地阅读。

Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:或者,如果您想使用与您的目录没有真正关联的 os 模块查找文件路径,您可以尝试以下操作:

from PyPDF2 import PdfFileReader
import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')

f = open(directory, 'rb')

reader = PdfFileReader(f)

contents = reader.getPage(0).extractText().split('\n')

f.close()

print(contents)

The find function can be found in Nadia Alramli's answer here Find a file in python查找功能可以在 Nadia Alramli 的回答中找到 Find a file in python

To Read the files from Multiple Folders in a directory , below code can be used- This Example is for reading pdf files:要从目录中的多个文件夹中读取文件,可以使用以下代码 - 此示例用于读取 pdf 文件:

import os
from tika import parser

path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
    for file in f:
        if ".pdf" in file:  # reading only PDF files
            file_join = os.path.join(r, file)   #getting full path 
            file_data = parser.from_file(file_join)     # parsing the PDF file 
            text = file_data['content']               # read the content 
            print(text)                  #print the content
def getTextPDF(pdfFileName,password=''):
    import PyPDF2
    from PyPDF2 import PdfFileReader, PdfFileWriter
    from nltk import sent_tokenize
    """ Extract Text from pdf  """
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    text = '\n'.join (text).replace("\n",'')
    text = sent_tokenize(text)
    return text

The issue was one of two things: (1) The text was not on page one - hence a user error.问题是两件事之一:(1)文本不在第一页 - 因此是用户错误。 (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2. (2) PyPDF2 无法提取文本 - 因此 PyPDF2 中存在错误。

Sadly, the second one still happens for some PDFs.可悲的是,对于某些 PDF,第二个仍然会发生。

Hello Rahul Pipalia,你好拉胡尔·皮帕利亚,

If not install PyPDF2 in your python so first install PyPDF2 after use this module.如果没有在你的 python 中安装PyPDF2 ,那么在使用这个模块后首先安装PyPDF2

Installation Steps for Ubuntu (Install python-pypdf) Ubuntu的安装步骤(安装python-pypdf)

  1. First, open terminal一、打开terminal
  2. After type sudo apt-get install python-pypdf输入sudo apt-get install python-pypdf

Your Probelm Solution您的问题解决方案

Try this below code,试试下面的代码,

# Import Library
import PyPDF2

# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)

page_content = page.extractText()

# Display content of the pdf
print page_content

Download the PDF from below link and try this code, https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1从以下链接下载 PDF 并尝试此代码, https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

I hope my answer is helpful.我希望我的回答有帮助。
If any query so comments, please.如果有任何疑问,请发表评论。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM