[英]Python what is the best way to handle multiple threads
Since my scaper is running so slow (one page at a time) so I'm trying to use thread to make it work faster. 由于我的scaper的运行速度非常慢(一次一页),因此我试图使用线程来使其运行更快。 I have a function scrape(website) that take in a website to scrape, so easily I can create each thread and call start() on each of them. 我有一个函数scrape(website),它接受一个网站进行抓取,因此我可以轻松地创建每个线程并在每个线程上调用start()。
Now I want to implement a num_threads variable that is the number of threads that I want to run at the same time. 现在,我想实现一个num_threads变量,该变量是我想同时运行的线程数。 What is the best way to handle those multiple threads? 处理这些多个线程的最佳方法是什么?
For ex: supposed num_threads = 5 , my goal is to start 5 threads then grab the first 5 website in the list and scrape them, then if thread #3 finishes, it will grab the 6th website from the list to scrape immidiately, not wait until other threads end. 例如:假定num_threads = 5,我的目标是启动5个线程,然后抓取列表中的前5个网站并将其抓取,然后,如果线程#3完成,它将从列表中抓取第6个网站立即抓取,而不是等待直到其他线程结束。
Any recommendation for how to handle it? 关于如何处理的任何建议? Thank you 谢谢
It depends. 这取决于。
If your code is spending most of its time waiting for network operations (likely, in a web scraping application), threading is appropriate. 如果您的代码大部分时间都在等待网络操作(很可能是在Web抓取应用程序中),则线程化是合适的。 The best way to implement a thread pool is to use concurrent.futures
in 3.4. 实现一个线程池的最佳方法是使用concurrent.futures
3.4。 Failing that, you can create a threading.Queue
object and write each thread as an infinite loop that consumes work objects from the queue and processes them. 失败的话,您可以创建一个threading.Queue
对象并将每个线程写为一个无限循环,该无限循环使用队列中的工作对象并对其进行处理。
If your code is spending most of its time processing data after you've downloaded it, threading is useless due to the GIL. 如果您的代码在下载后花费了大部分时间来处理数据,则由于GIL,线程化是无用的。 concurrent.futures
provides support for process concurrency, but again only works in 3.4+. concurrent.futures
提供支持的过程并发性,但也只在3.4+工作。 For older Pythons, use multiprocessing
. 对于较旧的Python,请使用multiprocessing
。 It provides a Pool
type which simplifies the process of creating a process pool. 它提供了一个Pool
类型,可简化创建过程池的过程。
You should profile your code (using cProfile
) to determine which of those two scenarios you are experiencing. 您应该使用cProfile
对代码进行cProfile
分析,以确定您遇到的是这两种情况中的哪一种。
If you're using Python 3, have a look at concurrent.futures.ThreadPoolExecutor
如果您使用的是Python 3,请查看concurrent.futures.ThreadPoolExecutor
Example pulled from the docs ThreadPoolExecutor Example : 从docs ThreadPoolExecutor示例中提取的示例 :
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
If you're using Python 2, there is a backport available: 如果您使用的是Python 2,则可以使用反向端口:
ThreadPoolExecutor Example : ThreadPoolExecutor示例 :
from concurrent import futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict((executor.submit(load_url, url, 60), url)
for url in URLS)
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (url,
future.exception()))
else:
print('%r page is %d bytes' % (url, len(future.result())))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.