我应该使用多处理、多线程还是其他东西？

Question

I am a beginner coder in Python and I'm currently working on a project that I've split into multiple code "modules" for lack of a better term.我是 Python 的初学者编码器，我目前正在从事一个项目，由于缺乏更好的术语，我将其拆分为多个代码“模块”。 My code effectively connects to an API, receives data, performs calculations on that data, and then returns communication with the API depending on the calculated values.我的代码有效地连接到 API，接收数据，对该数据执行计算，然后根据计算值返回与 API 的通信。

Here are the "modules" I've split my code up into:这是我将代码分成的“模块”：

API Authenticator --> is a loop that checks whether or not the current API access code has expired. API Authenticator --> 是一个循环，用于检查当前的 API 访问代码是否已过期。 If it is not the expiration time, nothing happens;如果不是过期时间，什么都不会发生； if it is the expiration time, the program submits a new access code request and stores the new access code in the computer's memory如果是过期时间，则程序提交新的访问码请求，并将新的访问码存储在计算机的memory中
Data stream --> is an open websockets connection that receives a continuous stream of data from the API数据 stream --> 是一个开放的 websockets 连接，它接收来自 API 的连续 stream 数据
Data processor --> takes data from data stream and computes processed values and stores them in the memory.数据处理器 --> 从数据 stream 中获取数据并计算处理后的值并将它们存储在 memory 中。
Executor --> if conditions are met using values derived in the data processor, use API key generated from authenticator module and submit request to server to do some action.执行器 --> 如果使用数据处理器中派生的值满足条件，则使用从验证器模块生成的 API 密钥并向服务器提交请求以执行某些操作。

I'm effectively interested in running all these modules at once.我对一次运行所有这些模块非常感兴趣。 The trouble is that they are continuous operations that need to be looped over and over again forever, while passing variables in-between them.问题是它们是连续的操作，需要永远一遍又一遍地循环，同时在它们之间传递变量。 Is there an effective way to have all of my "modules" run at once and pass data between eachother?有没有一种有效的方法让我的所有“模块”一次运行并在彼此之间传递数据？ --> im looking into multiprocessing and multithreading, but am interested to see if there's a best bet going forward. --> 我正在研究多处理和多线程，但有兴趣看看是否有最好的选择。

Thanks so much.非常感谢。

Answer 1

If you have both network latency and extensive CPU-intensive calculations, you should consider creating both a thread pool and a processing pool.如果您同时存在网络延迟和大量 CPU 密集型计算，则应考虑同时创建线程池和处理池。 The processing pool is passed to the worker function, which is invoked using the thread pool, and is used to perform the CPU-intensive work.处理池传递给worker function，使用线程池调用，用于执行CPU密集型工作。 Granted that this is a bit more complicated than just using multiprocessing, but it allows you to overlap the network retrieval more cheaply.当然，这比仅使用多处理要复杂一些，但它允许您更便宜地重叠网络检索。 The idea is that if you had, for example, 200 URLs to retrieve and 8 processors on your computer, you could create a process pool of size 200 to overlap the retrieval of the 200 URLs as efficiently as possible but, of course, you would only be able to at most process in parallel 8 of the CPU-intensive portions of the submitted tasks.这个想法是，例如，如果您有 200 个 URL 要检索，并且您的计算机上有 8 个处理器，您可以创建一个大小为 200 的进程池，以尽可能高效地重叠 200 个 URL 的检索，但是，当然，您会最多只能并行处理提交任务的 8 个 CPU 密集型部分。 However, creating 200 processes is a relatively expensive operation, especially on platforms that use spawn to create new processes such as Windows (you neglected to tag your question with your platform).但是，创建 200 个进程是一项相对昂贵的操作，尤其是在使用spawn创建新进程的平台上，例如 Windows（您忽略了用您的平台标记您的问题）。 It would therefore be more optimal (ie cheaper) to create a thread pool of size 200.因此，创建大小为 200 的线程池会更理想（即更便宜）。

The following is a benchmark simulating the above situation using just a multiprocessing pool versus a thread pool and multiprocessing pool.以下是仅使用多处理池与线程池和多处理池来模拟上述情况的基准。 Unfortunately, on Windows, my platform, you cannot create multiprocessing pools of sizes greater than 60. This means that if you are running under Windows and you had more than 60 URLs to retrieve, you really have no choice but to use the the two types of pools:不幸的是，在我的平台 Windows 上，您无法创建大小大于 60 的多处理池。这意味着，如果您在 Windows 下运行并且要检索的 URL 超过 60 个，您真的别无选择，只能使用这两种类型池：

from multiprocessing.pool import ThreadPool, Pool
from functools import partial
import time

def cpu_intensive_calculation(data):
    """ takes approximately .25 seconds on my desktop """
    QUARTER_SECOND_ITERATIONS = 5_000_000
    sum = 0
    for _ in range(QUARTER_SECOND_ITERATIONS):
        sum += 1
    return sum

def fetch_data(url):
    """ approximately .3 seconds of "I/O" """
    time.sleep(.3)
    return ""

def retrieve_url1(url):
    retrieved_data = fetch_data(url)
    result = cpu_intensive_calculation(retrieved_data)
    return result

def benchmark1():
    urls = ['x'] * 60
    t1 = time.time()
    pool = Pool(60)
    result = pool.map(retrieve_url1, urls)
    t2 = time.time()
    print('Multiprocessing only:', t2 - t1)

def retrieve_url2(process_pool, url):
    retrieved_data = fetch_data(url)
    result = process_pool.apply(cpu_intensive_calculation, args=(retrieved_data,))
    return result

def benchmark2():
    urls = ['x'] * 60 # 60
    t1 = time.time()
    thread_pool = ThreadPool(60)
    process_pool = Pool()
    result = thread_pool.map(partial(retrieve_url2, process_pool), urls)
    t2 = time.time()
    print('Thread pool and multiprocessing pool:', t2 - t1)

if __name__ == '__main__':
    benchmark1()
    benchmark2()

Prints:印刷：

Multiprocessing only: 5.233615875244141
Thread pool and multiprocessing pool: 4.272637844085693

Or use coroutines (package aiohttp from the PyPI repository) for the network retrieval and loop.run_in_executor to execute the CPU-intensive work.或者使用协程（来自PyPI存储库的包aiohttp ）进行网络检索和loop.run_in_executor来执行 CPU 密集型工作。

But if the calculations you perform are relatively trivial or are done in Python modules such as numpy that release the Global Interpreter Lock, just use multithreading for everything.但是，如果您执行的计算相对简单或在 Python 模块（例如numpy释放全局解释器锁）中完成，则只需对所有内容使用多线程即可。

Update更新

The following includes a benchmark that uses asyncio coroutines instead of multithreading , which can handles hundreds if not thousands of concurrent web retrievals.下面包括一个使用asyncio coroutines 而不是multithreading的基准，它可以处理数百甚至数千个并发 web 检索。 It uses the ProcessPoolExecutor class from the concurrent.futures module for running "blocking" operations, so I have switched to that class for multiprocessing:它使用来自concurrent.futures模块的ProcessPoolExecutor class 来运行“阻塞”操作，所以我已经切换到 class 进行多处理：

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from functools import partial
import asyncio
import time

def cpu_intensive_calculation(data):
    """ takes approximately .25 seconds on my desktop """
    QUARTER_SECOND_ITERATIONS = 5_000_000
    sum = 0
    for _ in range(QUARTER_SECOND_ITERATIONS):
        sum += 1
    return sum

def fetch_data(url):
    """ approximately .3 seconds of "I/O" """
    time.sleep(.3)
    return ""

def retrieve_url1(url):
    retrieved_data = fetch_data(url)
    result = cpu_intensive_calculation(retrieved_data)
    return result

def benchmark1():
    urls = ['x'] * 60
    t1 = time.time()
    with ProcessPoolExecutor(max_workers=60) as pool:
        result = list(pool.map(retrieve_url1, urls))
    t2 = time.time()
    print('Multiprocessing only:', t2 - t1)

def retrieve_url2(process_pool, url):
    retrieved_data = fetch_data(url)
    future = process_pool.submit(cpu_intensive_calculation, retrieved_data)
    return future.result()

def benchmark2():
    urls = ['x'] * 60 # 60
    t1 = time.time()
    with ThreadPoolExecutor(max_workers=60) as thread_pool:
        with ProcessPoolExecutor() as process_pool:
            result = list(thread_pool.map(partial(retrieve_url2, process_pool), urls))
    t2 = time.time()
    print('Thread pool and multiprocessing pool:', t2 - t1)
    return result

async def retrieve_url3(url):
    await asyncio.sleep(.3) # the I/O portion
    loop = asyncio.get_running_loop()
    result = await loop.run_in_executor(executor, cpu_intensive_calculation, '') # cpu portion
    return result


async def benchmark3():
    global executor
    urls = ['x'] * 60 # 60
    t1 = time.time()
    with ProcessPoolExecutor() as executor:
        result = await asyncio.gather(*(retrieve_url3(url) for url in urls))
    t2 = time.time()
    print('asyncio:', t2 - t1)

if __name__ == '__main__':
    benchmark1()
    benchmark2()
    asyncio.run(benchmark3())

Prints:印刷：

Multiprocessing only: 5.816989183425903
Thread pool and multiprocessing pool: 4.408833742141724
asyncio: 4.401129961013794

我应该使用多处理、多线程还是其他东西？

问题描述

1 个解决方案

解决方案1
0 2021-06-11 18:06:20

我应该使用多处理、多线程还是其他东西？

问题描述

1 个解决方案

解决方案1 0 2021-06-11 18:06:20

解决方案1
0 2021-06-11 18:06:20