在 Python3 中使用 for 循環進行多線程/多處理

Question

我有這個任務，它同時受 I/O 限制和 CPU 限制。

基本上，我從用戶那里得到一個查詢列表，谷歌搜索它們（通過 custom-search-api），將每個查詢結果存儲在 a.txt 文件中，並將所有結果存儲在 results.txt 文件中。

我在想也許並行性在這里可能是一個優勢。 我的整個任務都用 Object 包裝，它有 2 個成員字段，我應該在所有線程/進程（列表和字典）中使用它們。

因此，當我使用多處理時，我會得到奇怪的結果（我認為這是因為我的共享資源）。

IE：

class MyObject(object):
    _my_list = []
    _my_dict = {}

_my_dict包含"query_name":list()的key:value對。

_my_list是要在 google 中搜索的查詢列表。 可以安全地假設它沒有被寫入。

對於每個查詢：我在 google 上搜索它，獲取頂部結果並將其存儲在_my_dict

我想並行執行此操作。 我認為線程可能很好，但似乎他們減慢了工作..

我是如何嘗試這樣做的（這是每個查詢完成整個工作的方法）：

def _do_job(self, query):
    """ search the query on google (via http)
    save results on a .txt file locally. """

這是應該為所有查詢並行執行所有作業的方法：

def find_articles(self):
    p = Pool(processes=len(self._my_list))
    p.map_async(self._do_job, self._my_list)
    p.close()
    p.join()
    self._create_final_log()

上述執行不起作用，我得到損壞的結果......

但是，當我使用多線程時，結果很好，但速度很慢：

def find_articles(self):

    thread_pool = []
    for vendor in self._vendors_list:
        self._search_validate_cache(vendor)
        thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
        thread_pool.append(thread)
        thread.start()

    for thread in thread_pool:
        thread.join()

    self._create_final_log()

任何幫助將不勝感激，謝謝！

Answer 1

我過去在做類似項目時遇到過這種情況（多處理效率不高，單線程太慢，每個查詢啟動一個線程太快並且遇到瓶頸）。 我發現完成此類任務的一種有效方法是創建一個線程數量有限的線程池。 從邏輯上講，完成這項任務的最快方法是使用盡可能多的網絡資源而不會出現瓶頸，這就是為什么一次活躍的線程主動發出請求的原因是有上限的。

在您的情況下，使用帶有回調 function 的線程池循環查詢列表將是 go 通過所有數據的快速簡便的方法。 顯然，有很多因素會影響它，例如網絡速度和找到正確大小的線程池以避免瓶頸，但總的來說，我發現這很好用。

import threading

class MultiThread:

    def __init__(self, func, list_data, thread_cap=10):
        """
        Parameters
        ----------
            func : function
                Callback function to multi-thread
            threads : int
                Amount of threads available in the pool
            list_data : list
                List of data to multi-thread index
        """
        self.func = func
        self.thread_cap = thread_cap
        self.thread_pool = []
        self.current_index = -1
        self.total_index = len(list_data) - 1
        self.complete = False
        self.list_data = list_data
    
    def start(self):
        for _ in range(self.thread_cap):
            thread = threading.Thread(target=self._wrapper)
            self.thread_pool += [thread]
            thread.start()

    def _wrapper(self):
        while not self.complete:
            if self.current_index < self.total_index:
                self.current_index += 1
                self.func(self.list_data[self.current_index])
            else:
                self.complete = True

    def wait_on_completion(self):
        for thread in self.thread_pool:
            thread.join()

import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
    global _my_dict
    # code to grab data here
    r = s.get(base_url+query)
    _my_dict[query] = r.text # whatever parsed results
    print(r, query)

    

#start_time = time.time()

_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()


# output queries to file

#print("Time:{:2f}".format(time.time()-start_time))

您還可以打開文件和 output 任何您需要的文件，最后是 go 或 output 數據。 顯然，我在這里的復制品並不完全是您所需要的，但它是一個堅固的樣板，帶有我制作的輕量級 function，這將大大減少所需的時間。 它使用線程池調用對默認 function 的回調，該回調采用單個參數（查詢）。

在我的測試中，它在大約 2 秒內完成了 100 個查詢。 在我發現瓶頸之前，我絕對可以使用線程帽並降低時間。

在 Python3 中使用 for 循環進行多線程/多處理

問題描述

1 個解決方案

解決方案1
0 2021-12-12 01:27:59

在 Python3 中使用 for 循環進行多線程/多處理

問題描述

1 個解決方案

解決方案1 0 2021-12-12 01:27:59

解決方案1
0 2021-12-12 01:27:59