在 Python3 中使用 for 循环进行多线程/多处理

Question

我有这个任务，它同时受 I/O 限制和 CPU 限制。

基本上，我从用户那里得到一个查询列表，谷歌搜索它们（通过 custom-search-api），将每个查询结果存储在 a.txt 文件中，并将所有结果存储在 results.txt 文件中。

我在想也许并行性在这里可能是一个优势。 我的整个任务都用 Object 包装，它有 2 个成员字段，我应该在所有线程/进程（列表和字典）中使用它们。

因此，当我使用多处理时，我会得到奇怪的结果（我认为这是因为我的共享资源）。

IE：

class MyObject(object):
    _my_list = []
    _my_dict = {}

_my_dict包含"query_name":list()的key:value对。

_my_list是要在 google 中搜索的查询列表。 可以安全地假设它没有被写入。

对于每个查询：我在 google 上搜索它，获取顶部结果并将其存储在_my_dict

我想并行执行此操作。 我认为线程可能很好，但似乎他们减慢了工作..

我是如何尝试这样做的（这是每个查询完成整个工作的方法）：

def _do_job(self, query):
    """ search the query on google (via http)
    save results on a .txt file locally. """

这是应该为所有查询并行执行所有作业的方法：

def find_articles(self):
    p = Pool(processes=len(self._my_list))
    p.map_async(self._do_job, self._my_list)
    p.close()
    p.join()
    self._create_final_log()

上述执行不起作用，我得到损坏的结果......

但是，当我使用多线程时，结果很好，但速度很慢：

def find_articles(self):

    thread_pool = []
    for vendor in self._vendors_list:
        self._search_validate_cache(vendor)
        thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
        thread_pool.append(thread)
        thread.start()

    for thread in thread_pool:
        thread.join()

    self._create_final_log()

任何帮助将不胜感激，谢谢！

Answer 1

我过去在做类似项目时遇到过这种情况（多处理效率不高，单线程太慢，每个查询启动一个线程太快并且遇到瓶颈）。 我发现完成此类任务的一种有效方法是创建一个线程数量有限的线程池。 从逻辑上讲，完成这项任务的最快方法是使用尽可能多的网络资源而不会出现瓶颈，这就是为什么一次活跃的线程主动发出请求的原因是有上限的。

在您的情况下，使用带有回调 function 的线程池循环查询列表将是 go 通过所有数据的快速简便的方法。 显然，有很多因素会影响它，例如网络速度和找到正确大小的线程池以避免瓶颈，但总的来说，我发现这很好用。

import threading

class MultiThread:

    def __init__(self, func, list_data, thread_cap=10):
        """
        Parameters
        ----------
            func : function
                Callback function to multi-thread
            threads : int
                Amount of threads available in the pool
            list_data : list
                List of data to multi-thread index
        """
        self.func = func
        self.thread_cap = thread_cap
        self.thread_pool = []
        self.current_index = -1
        self.total_index = len(list_data) - 1
        self.complete = False
        self.list_data = list_data
    
    def start(self):
        for _ in range(self.thread_cap):
            thread = threading.Thread(target=self._wrapper)
            self.thread_pool += [thread]
            thread.start()

    def _wrapper(self):
        while not self.complete:
            if self.current_index < self.total_index:
                self.current_index += 1
                self.func(self.list_data[self.current_index])
            else:
                self.complete = True

    def wait_on_completion(self):
        for thread in self.thread_pool:
            thread.join()

import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
    global _my_dict
    # code to grab data here
    r = s.get(base_url+query)
    _my_dict[query] = r.text # whatever parsed results
    print(r, query)

    

#start_time = time.time()

_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()


# output queries to file

#print("Time:{:2f}".format(time.time()-start_time))

您还可以打开文件和 output 任何您需要的文件，最后是 go 或 output 数据。 显然，我在这里的复制品并不完全是您所需要的，但它是一个坚固的样板，带有我制作的轻量级 function，这将大大减少所需的时间。 它使用线程池调用对默认 function 的回调，该回调采用单个参数（查询）。

在我的测试中，它在大约 2 秒内完成了 100 个查询。 在我发现瓶颈之前，我绝对可以使用线程帽并降低时间。

在 Python3 中使用 for 循环进行多线程/多处理

问题描述

1 个解决方案

解决方案1
0 2021-12-12 01:27:59

在 Python3 中使用 for 循环进行多线程/多处理

问题描述

1 个解决方案

解决方案1 0 2021-12-12 01:27:59

解决方案1
0 2021-12-12 01:27:59