Fastest Way To Parse Hundreds Of Pdfs In Directory

Question

Im currently using python for word recognition within a directory. The directory is made up of hundreds of pdfs, which as I run this loop that parses the pdfs and searches for the words I realize it takes an absurd amount of time (+2 mins), i am thinking that implementing multi threading can speed up the work some, but having a difficult time thinking of the best way to do this by splitting up the files in the directory, here is what the directory somewhat looks like:

But obviously there is a ton more folders and ton more files, here is the current code for parsing the pdf:

def searchButtonClicked(self):
        name = self.lineEdit.text()
        self.listWidget.addItem("Searching with the name: " + name)
        for root, dirs, files in os.walk(self.directory):
            for file_name in files:
                file_path = os.path.join(root, file_name)
                print(file_path)
                if file_path.endswith(".pdf"):
                    pdf_object = PyPDF2.PdfFileReader(file_path, strict=False)
                    num_of_pages = pdf_object.getNumPages()

                    for i in range(0, num_of_pages):
                        PageObj = pdf_object.getPage(i)
                        Text = PageObj.extractText() 
                        ResSearch = re.search(name, Text)
                        if(ResSearch):
                            self.listWidget.addItem(file_path + " Page" + str(i+1))
                        continue
                    else:
                        continue
        self.listWidget.addItem("---------------Done---------------")

What would be the most efficient way to parse this directory?

Answer 1

I'd try using a compiled regex instead of str. For example put this line before the loop:

name = re.compile(name)

That avoids to compile the expression in each search.

Another optimization could be to use multiprocessing and read the filenames from a queue. Threading probably would be even slower.

You could split the code in functions and then profile the scrip to see what part is the slower with:

python -m cProfile script.py

which lists all functions with call count and time

Fastest Way To Parse Hundreds Of Pdfs In Directory

Question

1 answers

solution1
0 2021-10-25 23:43:47

Fastest Way To Parse Hundreds Of Pdfs In Directory

Question

1 answers

solution1 0 2021-10-25 23:43:47

solution1
0 2021-10-25 23:43:47