Python MultiThreading Not Taking From Queue Properly

Question

I am working on parsing a very large amount of documents at one time, to do this I have integrated a Queue which holds a list of the documents, after that I assign threads to run parse through the pdf files into the Queue is empty in which it should return my results, but when i output the size of the Queue a lot of the time even when multiple threads are running it outputs the same number meaning some threads are getting the same pdf, I believe I have an inefficient system with the threads and would like any input to make this parse and run faster, here is the current code I have.

    def searchButtonClicked(self):
        name = self.lineEdit.text()
        self.listWidget.addItem("Searching with the name: " + name)
        num_threads = 35
        try:
            with open("my_saved_queue.obj","rb") as queue_save_file:
                self.my_loaded_list: Queue = pickle.load(queue_save_file)
                self.my_loaded_queue = queue.Queue()
                for row in self.my_loaded_list:
                    self.my_loaded_queue.put(row)
                for i in range(num_threads):
                    worker = threading.Thread(target=self.searchName, args=(name,))
                    worker.setDaemon(True)
                    worker.start()
        except:
            self.saveFile()
            self.searchButtonClicked()

    def saveFile(self):
        my_queue = list()
        for root, dirs, files in os.walk(self.directory):
            for file_name in files:
                file_path = os.path.join(root, file_name)
                if file_path.endswith(".pdf"):
                    my_queue.insert(0,[PyPDF2.PdfFileReader(file_path, strict=False), file_path])
        with open("my_saved_queue.obj","wb+") as queue_save_file:
            pickle.dump(my_queue, queue_save_file)

    def searchName(self, name):
        try:
            queue_obj = self.my_loaded_queue.get(False)
        except Empty:
            self.my_loaded_queue.task_done()
            self.listWidget.addItem("---------------Done---------------")
        else:
            pdf_object = queue_obj[0]
            file_path = queue_obj[1]
            num_of_pages = pdf_object.getNumPages()

            for i in range(0, num_of_pages):
                PageObj = pdf_object.getPage(i)
                Text = PageObj.extractText() 
                ResSearch = re.search(name, Text)
                if(ResSearch):
                    print(file_path + " Page " + str(i+1))
                    self.listWidget.addItem(file_path + " Page" + str(i+1))
                    continue
                else:
                    continue
            print(self.my_loaded_queue.qsize())
            if not self.my_loaded_queue.empty():
                self.searchName(name)

    def clearListWidget(self):
        self.listWidget.clear()

Essentially I parse the directory and store all of the pdfs into a list which i then save back to the directory to access when searching for a name, this is so it saves time and I dont have to parse all the pdfs. Here is the output when outputting the qsize at the bottom of searchName():

As we can see sometimes its outputting the size multiple times meaning multiple threads even though at the beginning we are getting off the top of the queue which should remove some of the size.

Answer 1

That is normal.

this could happen in this order:

thread 1 get an object from the queue.
thread 2 get an object from the queue
thread 2 prints size
thread 1 prints size

But that is not a problem. The program could work perfectly well.

Python MultiThreading Not Taking From Queue Properly

Question

1 answers

solution1
0 2021-10-29 03:10:05

Python MultiThreading Not Taking From Queue Properly

Question

1 answers

solution1 0 2021-10-29 03:10:05

solution1
0 2021-10-29 03:10:05