简体   繁体   中英

Python Script to Iterate through PDF's in a directory and find a matching line

Currently i get all my reports delivered to me via email attached as a pdf. What i have done is set outlook to automatically download those files to a certain directory every day. Sometimes those pdfs dont have any data in them and only contain the line "There is no data to present that matches the selection criteria". I would like to create a python program that iterates through every pdf file in that directory, open it and look for those words, if they contain that phrase then delete that particular pdf. If they do not then do nothing. Through help with reddit i have pieced together the code below:

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open("{}/{}".format(directory,file), 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

I have tested with 3 files one containing the matching phrase. No matter how the files are named or what order it will fail. I have tested it with one file in the directory named 3.pdf. Below is the error code is get.

FileNotFoundError: [WinError 2] The system cannot find the file specified: >'3.pdf'

This would reduce my workload dramatically and be a great learning example for me the newbie. All help/criticism welcome.

See below:

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open(os.path.join(directory,file), 'rb') as pdfFileObj:  # Changes here
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM