[英]Multithreading / Multiprocessing with a for-loop in Python3
I have this task which is sort of I/O bound and CPU bound at the same time.我有这个任务,它同时受 I/O 限制和 CPU 限制。
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a.txt file, and storing all results in a results.txt file.基本上,我从用户那里得到一个查询列表,谷歌搜索它们(通过 custom-search-api),将每个查询结果存储在 a.txt 文件中,并将所有结果存储在 results.txt 文件中。
I was thinking that maybe parallelism might be an advantage here.我在想也许并行性在这里可能是一个优势。 My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).我的整个任务都用 Object 包装,它有 2 个成员字段,我应该在所有线程/进程(列表和字典)中使用它们。
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).因此,当我使用多处理时,我会得到奇怪的结果(我认为这是因为我的共享资源)。
ie: IE:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict
contains key:value
pairs of "query_name":list()
. _my_dict
包含"query_name":list()
的key:value
对。
_my_list
is a list of queries to search in google. _my_list
是要在 google 中搜索的查询列表。 It is safe to assume that it is not written into.可以安全地假设它没有被写入。
For each query: I search it on google, grab the top results and store it in _my_dict
对于每个查询:我在 google 上搜索它,获取顶部结果并将其存储在_my_dict
I want to do this in parallel.我想并行执行此操作。 I thought that threading may be good but it seems that they slow the work..我认为线程可能很好,但似乎他们减慢了工作..
how I attempted to do it (this is the method which is doing the entire job per query):我是如何尝试这样做的(这是每个查询完成整个工作的方法):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:这是应该为所有查询并行执行所有作业的方法:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...上述执行不起作用,我得到损坏的结果......
When I use multithreading however, the results are fine, but very slow:但是,当我使用多线程时,结果很好,但速度很慢:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!任何帮助将不胜感激,谢谢!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked).我过去在做类似项目时遇到过这种情况(多处理效率不高,单线程太慢,每个查询启动一个线程太快并且遇到瓶颈)。 I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads.我发现完成此类任务的一种有效方法是创建一个线程数量有限的线程池。 Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.从逻辑上讲,完成这项任务的最快方法是使用尽可能多的网络资源而不会出现瓶颈,这就是为什么一次活跃的线程主动发出请求的原因是有上限的。
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data.在您的情况下,使用带有回调 function 的线程池循环查询列表将是 go 通过所有数据的快速简便的方法。 Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.显然,有很多因素会影响它,例如网络速度和找到正确大小的线程池以避免瓶颈,但总的来说,我发现这很好用。
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end.您还可以打开文件和 output 任何您需要的文件,最后是 go 或 output 数据。 Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes.显然,我在这里的复制品并不完全是您所需要的,但它是一个坚固的样板,带有我制作的轻量级 function,这将大大减少所需的时间。 It uses a thread pool to call a callback to a default function that takes a single parameter (the query).它使用线程池调用对默认 function 的回调,该回调采用单个参数(查询)。
In my test here, it completed cycling 100 queries in ~2 seconds.在我的测试中,它在大约 2 秒内完成了 100 个查询。 I could definitely play with the thread cap and get the timings lower before I find the bottleneck.在我发现瓶颈之前,我绝对可以使用线程帽并降低时间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.