Multithreading / Multiprocessing with a for-loop in Python3

Question

I have this task which is sort of I/O bound and CPU bound at the same time.

Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a.txt file, and storing all results in a results.txt file.

I was thinking that maybe parallelism might be an advantage here. My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).

Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).

ie:

class MyObject(object):
    _my_list = []
    _my_dict = {}

_my_dict contains key:value pairs of "query_name":list() .

_my_list is a list of queries to search in google. It is safe to assume that it is not written into.

For each query: I search it on google, grab the top results and store it in _my_dict

I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..

how I attempted to do it (this is the method which is doing the entire job per query):

def _do_job(self, query):
    """ search the query on google (via http)
    save results on a .txt file locally. """

this is the method which is supposed to execute all jobs for all queries in parallel:

def find_articles(self):
    p = Pool(processes=len(self._my_list))
    p.map_async(self._do_job, self._my_list)
    p.close()
    p.join()
    self._create_final_log()

The above execution does not work, I get corrupted results...

When I use multithreading however, the results are fine, but very slow:

def find_articles(self):

    thread_pool = []
    for vendor in self._vendors_list:
        self._search_validate_cache(vendor)
        thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
        thread_pool.append(thread)
        thread.start()

    for thread in thread_pool:
        thread.join()

    self._create_final_log()

Any help would be appreciated, thanks!

Answer 1

I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.

In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.

import threading

class MultiThread:

    def __init__(self, func, list_data, thread_cap=10):
        """
        Parameters
        ----------
            func : function
                Callback function to multi-thread
            threads : int
                Amount of threads available in the pool
            list_data : list
                List of data to multi-thread index
        """
        self.func = func
        self.thread_cap = thread_cap
        self.thread_pool = []
        self.current_index = -1
        self.total_index = len(list_data) - 1
        self.complete = False
        self.list_data = list_data
    
    def start(self):
        for _ in range(self.thread_cap):
            thread = threading.Thread(target=self._wrapper)
            self.thread_pool += [thread]
            thread.start()

    def _wrapper(self):
        while not self.complete:
            if self.current_index < self.total_index:
                self.current_index += 1
                self.func(self.list_data[self.current_index])
            else:
                self.complete = True

    def wait_on_completion(self):
        for thread in self.thread_pool:
            thread.join()

import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
    global _my_dict
    # code to grab data here
    r = s.get(base_url+query)
    _my_dict[query] = r.text # whatever parsed results
    print(r, query)

    

#start_time = time.time()

_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()


# output queries to file

#print("Time:{:2f}".format(time.time()-start_time))

You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).

In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.

Multithreading / Multiprocessing with a for-loop in Python3

Question

1 answers

solution1
0 2021-12-12 01:27:59

Multithreading / Multiprocessing with a for-loop in Python3

Question

1 answers

solution1 0 2021-12-12 01:27:59

solution1
0 2021-12-12 01:27:59