简体   繁体   English

Python是处理多个线程的最佳方法是什么

[英]Python what is the best way to handle multiple threads

Since my scaper is running so slow (one page at a time) so I'm trying to use thread to make it work faster. 由于我的scaper的运行速度非常慢(一次一页),因此我试图使用线程来使其运行更快。 I have a function scrape(website) that take in a website to scrape, so easily I can create each thread and call start() on each of them. 我有一个函数scrape(website),它接受一个网站进行抓取,因此我可以轻松地创建每个线程并在每个线程上调用start()。

Now I want to implement a num_threads variable that is the number of threads that I want to run at the same time. 现在,我想实现一个num_threads变量,该变量是我想同时运行的线程数。 What is the best way to handle those multiple threads? 处理这些多个线程的最佳方法是什么?

For ex: supposed num_threads = 5 , my goal is to start 5 threads then grab the first 5 website in the list and scrape them, then if thread #3 finishes, it will grab the 6th website from the list to scrape immidiately, not wait until other threads end. 例如:假定num_threads = 5,我的目标是启动5个线程,然后抓取列表中的前5个网站并将其抓取,然后,如果线程#3完成,它将从列表中抓取第6个网站立即抓取,而不是等待直到其他线程结束。

Any recommendation for how to handle it? 关于如何处理的任何建议? Thank you 谢谢

It depends. 这取决于。

If your code is spending most of its time waiting for network operations (likely, in a web scraping application), threading is appropriate. 如果您的代码大部分时间都在等待网络操作(很可能是在Web抓取应用程序中),则线程化是合适的。 The best way to implement a thread pool is to use concurrent.futures in 3.4. 实现一个线程池的最佳方法是使用concurrent.futures 3.4。 Failing that, you can create a threading.Queue object and write each thread as an infinite loop that consumes work objects from the queue and processes them. 失败的话,您可以创建一个threading.Queue对象并将每个线程写为一个无限循环,该无限循环使用队列中的工作对象并对其进行处理。

If your code is spending most of its time processing data after you've downloaded it, threading is useless due to the GIL. 如果您的代码在下载后花费了大部分时间来处理数据,则由于GIL,线程化是无用的。 concurrent.futures provides support for process concurrency, but again only works in 3.4+. concurrent.futures提供支持的过程并发性,但也只在3.4+工作。 For older Pythons, use multiprocessing . 对于较旧的Python,请使用multiprocessing It provides a Pool type which simplifies the process of creating a process pool. 它提供了一个Pool类型,可简化创建过程池的过程。

You should profile your code (using cProfile ) to determine which of those two scenarios you are experiencing. 您应该使用cProfile对代码进行cProfile分析,以确定您遇到的是这两种情况中的哪一种。

If you're using Python 3, have a look at concurrent.futures.ThreadPoolExecutor 如果您使用的是Python 3,请查看concurrent.futures.ThreadPoolExecutor

Example pulled from the docs ThreadPoolExecutor Example : 从docs ThreadPoolExecutor示例中提取的示例

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib.request.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

If you're using Python 2, there is a backport available: 如果您使用的是Python 2,则可以使用反向端口:

ThreadPoolExecutor Example : ThreadPoolExecutor示例

from concurrent import futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

def load_url(url, timeout):
    return urllib.request.urlopen(url, timeout=timeout).read()

with futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_to_url = dict((executor.submit(load_url, url, 60), url)
                         for url in URLS)

    for future in futures.as_completed(future_to_url):
        url = future_to_url[future]
        if future.exception() is not None:
            print('%r generated an exception: %s' % (url,
                                                     future.exception()))
        else:
            print('%r page is %d bytes' % (url, len(future.result())))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 处理多个套接字连接的最佳方法是什么 - What is the best way to handle multiple socket connections 在Python中处理XML的最佳方法是什么? - What's the best way to handle XML in Python? 在python中处理字典的最佳方法是什么? - What is the best way to handle dictionaries in python? 在Python浏览器自动化(使用Selenium)中处理多个动作的最佳方法是什么? - What would be the best way to handle multiple actions in Python browser automation (with Selenium)? Python 中在不同线程中调用相同函数的最佳方法是什么? - What is the best way in Python to call the same function in separate threads? 基于Python版本处理依赖关系的最佳方法是什么? - What is the best way to handle dependencies based on the Python version? python中处理一系列功能检查的最佳/习惯用法是什么 - What is the best/idioms way in python to handle series of function check 在python Tornado服务器上处理许多路径的最佳方法是什么? - What's the best way to handle many paths on a python Tornado server? 处理python(10万行)中的大量输入的最佳方法是什么? - What is the best way to handle large inputs in python (100 k lines)? 在python数据框中处理丢失索引的最佳方法是什么? - What is the best way to handle missing indices in a python dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM