Hi i am trying to delete the pdf files in a folder which contains the word "Publications périodiques" in the first, so far i am able to search for the word but dont know how to delete the files.
Code used to search for the word in pdf files
import PyPDF2
import re
object = PyPDF2.PdfFileReader("202105192101394-60.pdf")
String = "Publications périodiques"
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
print(ResSearch)
Also how to loop this in multiple files
You can delete any file using:
import os
os.remove("C://fake/path/to/file.pdf")
In order to delete a file use
import os
os.unlink(file_path)
where file_path is the path to the relevant file
For browsing through files:
from os import walk
mypath= "./"
_, _, filenames = next(walk(mypath))
Process each file:
for file in filenames:
foundWord = yourFunction(file)
if foundWord:
os.remove(file) # Delete the file
Write yourFunction() such that it returns true/false.
I suppose your re.search() is already functional? Or is that part of your question?
If functional, you could just use os to get all the files, perhaps filter them through a list comprehension to only get the pdf-files like so:
import os
all_files = os.listdir("C:/../or_whatever_path")
only_pdf_files = [file for file in all_files if ".pdf" in file]
from that point on, you can iterate through all pdf-files and just execute the same code you've already written for each one and when "ResSearch" is True, delete the File via os.remove() method:
for file in only_pdf_files:
object = PyPDF2.PdfFileReader(file)
String = "Publications périodiques"
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
if ResSearch:
os.remove(file)
else:
pass
EDIT:
When your pdf-files aren't in the same directory as your python script, the path is to be added to the os.remove()
method.
for file in only_pdf_files:
object = PyPDF2.PdfFileReader(file)
NumPages = object.getNumPages()
String = "Publications périodiques"
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
if ResSearch:
os.remove(file)
else:
pass
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.