简体   繁体   English

Python脚本以遍历目录中的PDF并找到匹配的行

[英]Python Script to Iterate through PDF's in a directory and find a matching line

Currently i get all my reports delivered to me via email attached as a pdf. 目前,我将所有报告通过pdf电子邮件发送给我。 What i have done is set outlook to automatically download those files to a certain directory every day. 我所做的是将Outlook设置为每天自动将那些文件下载到某个目录。 Sometimes those pdfs dont have any data in them and only contain the line "There is no data to present that matches the selection criteria". 有时,这些pdf中没有任何数据,而仅包含“没有符合选择标准的数据”行。 I would like to create a python program that iterates through every pdf file in that directory, open it and look for those words, if they contain that phrase then delete that particular pdf. 我想创建一个python程序,该程序遍历该目录中的每个pdf文件,打开它并查找那些单词,如果它们包含该短语,则删除该特定的pdf。 If they do not then do nothing. 如果他们不这样做,那么什么也不做。 Through help with reddit i have pieced together the code below: 通过reddit的帮助,我整理了以下代码:

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open("{}/{}".format(directory,file), 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

I have tested with 3 files one containing the matching phrase. 我已经测试了3个文件,其中一个包含匹配短语。 No matter how the files are named or what order it will fail. 无论文件如何命名或以什么顺序失败。 I have tested it with one file in the directory named 3.pdf. 我已经在名为3.pdf的目录中的一个文件中对其进行了测试。 Below is the error code is get. 下面是错误代码被获取。

FileNotFoundError: [WinError 2] The system cannot find the file specified: >'3.pdf' FileNotFoundError:[WinError 2]系统找不到指定的文件:>'3.pdf'

This would reduce my workload dramatically and be a great learning example for me the newbie. 这将大大减少我的工作量,并且对我来说是一个很好的学习实例。 All help/criticism welcome. 欢迎所有帮助/批评。

See below: 见下文:

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open(os.path.join(directory,file), 'rb') as pdfFileObj:  # Changes here
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM