简体   繁体   中英

Deleting pdf files from a folder if the search word is present using python

Hi i am trying to delete the pdf files in a folder which contains the word "Publications périodiques" in the first, so far i am able to search for the word but dont know how to delete the files.

Code used to search for the word in pdf files

import PyPDF2
import re
object = PyPDF2.PdfFileReader("202105192101394-60.pdf")
String = "Publications périodiques"
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

Also how to loop this in multiple files

You can delete any file using:

import os
os.remove("C://fake/path/to/file.pdf")

In order to delete a file use

import os
os.unlink(file_path)

where file_path is the path to the relevant file

For browsing through files:

from os import walk
mypath= "./"
_, _, filenames = next(walk(mypath))

Process each file:

for file in filenames:
    foundWord = yourFunction(file)
    if foundWord:
        os.remove(file) # Delete the file

Write yourFunction() such that it returns true/false.

I suppose your re.search() is already functional? Or is that part of your question?

If functional, you could just use os to get all the files, perhaps filter them through a list comprehension to only get the pdf-files like so:

import os

all_files = os.listdir("C:/../or_whatever_path")
only_pdf_files = [file for file in all_files if ".pdf" in file]

from that point on, you can iterate through all pdf-files and just execute the same code you've already written for each one and when "ResSearch" is True, delete the File via os.remove() method:

for file in only_pdf_files:
   object = PyPDF2.PdfFileReader(file)
   String = "Publications périodiques"
   for i in range(0, NumPages):
      PageObj = object.getPage(i)
      print("this is page " + str(i))
      Text = PageObj.extractText()
      # print(Text)
      ResSearch = re.search(String, Text)
      if ResSearch:
         os.remove(file)
      else:
         pass

EDIT:

When your pdf-files aren't in the same directory as your python script, the path is to be added to the os.remove() method.

for file in only_pdf_files:
    object = PyPDF2.PdfFileReader(file)
    NumPages = object.getNumPages()
    String = "Publications périodiques"
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        
      # print(Text)
        ResSearch = re.search(String, Text)
        if ResSearch:
            os.remove(file)
        else:
            pass

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM