简体   繁体   中英

How to read pdf files one by one from a folder in python

I am reading pdf files and trying to extract keywords from them through NLP techniques.Right now the program accepts one pdf at a time. I have a folder say in D drive named 'pdf_docs'. The folder contains many pdf documents. My goal is to read each pdf file one by one from the folder. How can I do that in python. The code so far working successfully is like below.

import PyPDF2

file = open('abc.pdf','rb')


fileReader = PyPDF2.PdfFileReader(file)

count = 0

while count < 3:

    pageObj = fileReader.getPage(count)
    count +=1
    text = pageObj.extractText()

you can use glob in order use pattern matching for getting a list of all pdf files in your directory.

import glob

pdf_dir = "/foo/dir"

pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
    do_your_stuff()

First read all files that are available under that directory

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

And then run your code for each file in that list

import PyPDF2
from os import listdir
from os.path import isfile, join


onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    fileReader = PyPDF2.PdfFileReader(open(file,'rb'))

    count = 0

    while count < 3:

        pageObj = fileReader.getPage(count)
        count +=1
        text = pageObj.extractText()

os.listdir() will get you everything that's in a directory - files and directories. So be careful to have only pdf files in your path or you will need to implement simple filtration for list.

Edit 1

You can also use glob module, as it does pattern matching.

>>> import glob
>>> print(glob.glob('/home/rszamszur/*.sh'))
['/home/rszamszur/work-monitors.sh', '/home/rszamszur/default-monitor.sh', '/home/rszamszur/home-monitors.sh']

Key difference between OS module and glob is that OS will work for all systems, where glob only for Unix like.

import PyPDF2
import re
import glob

#your full path of directory
mypath = "dir"
for file in glob.glob(mypath + "/*.pdf"):
    print(file)
    if file.endswith('.pdf'):
        fileReader = PyPDF2.PdfFileReader(open(file, "rb"))
        count = 0
        count = fileReader.numPages
        while count >= 0:
            count -= 1
            pageObj = fileReader.getPage(count)
            text = pageObj.extractText()
            print(text)
        num = re.findall(r'[0-9]+', text)
        print(num)
    else:
        print("not in format")

Let's go through the code: In python we can't handle Pdf files normally. so we need to install PyPDF2 package then import the package. "glob" function is used to read the files inside the directory. using "for" loop to get the files inside the folder. now check the file type is it in pdf format or not by using "if" condition. now we are reading the pdf files in the folder using "PdfFileReader"function. then getting number of pages in the pdf document. By using while loop to getting all pages and print all text in the file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM