Im currently using python for word recognition within a directory. The directory is made up of hundreds of pdfs, which as I run this loop that parses the pdfs and searches for the words I realize it takes an absurd amount of time (+2 mins), i am thinking that implementing multi threading can speed up the work some, but having a difficult time thinking of the best way to do this by splitting up the files in the directory, here is what the directory somewhat looks like:
But obviously there is a ton more folders and ton more files, here is the current code for parsing the pdf:
def searchButtonClicked(self):
name = self.lineEdit.text()
self.listWidget.addItem("Searching with the name: " + name)
for root, dirs, files in os.walk(self.directory):
for file_name in files:
file_path = os.path.join(root, file_name)
print(file_path)
if file_path.endswith(".pdf"):
pdf_object = PyPDF2.PdfFileReader(file_path, strict=False)
num_of_pages = pdf_object.getNumPages()
for i in range(0, num_of_pages):
PageObj = pdf_object.getPage(i)
Text = PageObj.extractText()
ResSearch = re.search(name, Text)
if(ResSearch):
self.listWidget.addItem(file_path + " Page" + str(i+1))
continue
else:
continue
self.listWidget.addItem("---------------Done---------------")
What would be the most efficient way to parse this directory?
I'd try using a compiled regex instead of str. For example put this line before the loop:
name = re.compile(name)
That avoids to compile the expression in each search.
Another optimization could be to use multiprocessing and read the filenames from a queue. Threading probably would be even slower.
You could split the code in functions and then profile the scrip to see what part is the slower with:
python -m cProfile script.py
which lists all functions with call count and time
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.