python asyncio.gather vs asyncio.as_completed 当 IO 任务后跟 CPU 绑定任务

Question

I have a program workflow as follows: 1. IO-bound (web page fetch) -> 2. cpu-bound (processing information) -> 3. IO-bound (writing results to database).我的程序工作流程如下：1. IO-bound（网页获取）-> 2. cpu-bound（处理信息）-> 3. IO-bound（将结果写入数据库）。

I'm presently using aiohttp to fetch the web pages.我目前正在使用 aiohttp 来获取 web 页面。 I am currently using asyncio.as_completed to gather Step 1 tasks and pass them to Step 2 as completed.我目前正在使用 asyncio.as_completed 来收集步骤 1 的任务并将它们传递给步骤 2 完成。 My concern is that this may be interfering with the completion of Step 1 tasks by consuming cpu resources and blocking program flow in Step 2.我担心的是，这可能会通过消耗 cpu 资源并阻塞步骤 2 中的程序流来干扰步骤 1 任务的完成。

I've tried to use ProcessPoolExecutor to farm out the Step 2 tasks to other processes, but Step 2 tasks uses non-pickleable data structures and functions.我尝试使用 ProcessPoolExecutor 将第 2 步任务外包给其他进程，但第 2 步任务使用不可腌制的数据结构和函数。 I've tried ThreadPoolExecutor, and while that worked (eg it didn't crash), it is my understanding that doing so for CPU-bound tasks is counter-productive.我已经尝试过 ThreadPoolExecutor，虽然它有效（例如它没有崩溃），但我的理解是，对 CPU 密集型任务这样做会适得其反。

Because the workflow has an intermediate cpu-bound task, would it be more efficient to use asyncio.gather (instead of asyncio.as_completed) to complete all of the Step 1 processes before moving on to Step 2?因为工作流有一个中间 cpu-bound 任务，使用 asyncio.gather（而不是 asyncio.as_completed）在进入步骤 2 之前完成所有步骤 1 流程会更有效吗？

Sample asyncio.as_completed code:示例 asyncio.as_completed 代码：

async with ClientSession() as session:
    tasks = {self.fetch(session, url) for url in self.urls}
    for task in asyncio.as_completed(tasks):
        raw_data = await asyncio.shield(task)
        data = self.extract_data(*raw_data)
        await self.store_data(data)

Sample asyncio.gather code:示例 asyncio.gather 代码：

async with ClientSession() as session:
    tasks = {self.fetch(session, url) for url in self.urls}
    results = await asyncio.gather(*tasks)
for result in results:
    data = self.extract_data(*result)
    await self.store_data(data)

Preliminary tests with limited samples show as_completed to be slightly more efficient than gather: ~2.98s (as_completed) vs ~3.15s (gather).有限样本的初步测试显示 as_completed 比 collect 稍微高效一些：~2.98s (as_completed) vs ~3.15s (gather)。 But is there an asyncio conceptual issue that would favor one solution over another?但是是否存在一个 asyncio 概念问题会偏向于一种解决方案而不是另一种解决方案？

Answer 1

"I've tried ThreadPoolExecutor, [...] it is my understanding that doing so for CPU-bound tasks is counter-productive." “我尝试过 ThreadPoolExecutor，[...] 我的理解是，对 CPU 密集型任务这样做会适得其反。” - it is countrproductiv in a sense you won't have two such asks running Python code in parallel, using multiple CPU cores - but otherwise, it will work to free up your asyncio Loop to continue working , if only munching code for one task at a time. - 从某种意义上说，你不会有两个这样的问题，使用多个 CPU 内核并行运行 Python 代码 - 但除此之外，它会释放你的 asyncio 循环以继续工作，如果只是在一个任务中咀嚼代码一次。

If you can't pickle things to a subprocess, running the CPU bound tasks in a ThreadPoolExecutor is good enough.如果您不能将事情腌制到子进程，那么在 ThreadPoolExecutor 中运行 CPU 绑定任务就足够了。

Otherwise, just sprinkle you cpu code with some await asyncio.sleep(0) (inside the loops) and run them normally as coroutines: that is enough for a cpu bound task not to lock the asyncio loop.否则，只需在 cpu 代码中添加一些await asyncio.sleep(0) （在循环内）并将它们作为协程正常运行：这足以让 cpu 绑定任务不锁定 asyncio 循环。

python asyncio.gather vs asyncio.as_completed 当 IO 任务后跟 CPU 绑定任务

问题描述

1 个解决方案

解决方案1
0 2021-05-26 14:58:43

python asyncio.gather vs asyncio.as_completed 当 IO 任务后跟 CPU 绑定任务

问题描述

1 个解决方案

解决方案1 0 2021-05-26 14:58:43

解决方案1
0 2021-05-26 14:58:43