[英]Python (3.7+) multiprocessing: replace Pipe connection between master and workers with asyncio for IO concurrency
Suppose we have a following toy version of master-worker pipeline to parallel data collection假设我们有以下玩具版本的 master-worker 管道来并行数据收集
# pip install gym
import gym
import numpy as np
from multiprocessing import Process, Pipe
def worker(master_conn, worker_conn):
master_conn.close()
env = gym.make('Pendulum-v0')
env.reset()
while True:
cmd, data = worker_conn.recv()
if cmd == 'close':
worker_conn.close()
break
elif cmd == 'step':
results = env.step(data)
worker_conn.send(results)
class Master(object):
def __init__(self):
self.master_conns, self.worker_conns = zip(*[Pipe() for _ in range(10)])
self.list_process = [Process(target=worker, args=[master_conn, worker_conn], daemon=True)
for master_conn, worker_conn in zip(self.master_conns, self.worker_conns)]
[p.start() for p in self.list_process]
[worker_conn.close() for worker_conn in self.worker_conns]
def go(self, actions):
[master_conn.send(['step', action]) for master_conn, action in zip(self.master_conns, actions)]
results = [master_conn.recv() for master_conn in self.master_conns]
return results
def close(self):
[master_conn.send(['close', None]) for master_conn in self.master_conns]
[p.join() for p in self.list_process]
master = Master()
l = []
T = 1000
for t in range(T):
actions = np.random.rand(10, 1)
results = master.go(actions)
l.append(len(results))
sum(l)
Because of the Pipe connections between master each worker, for every time step, we have to send a command to the worker through the Pipe, and the worker sends back the results.由于master每个worker之间有Pipe连接,对于每一个时间步,我们都要通过Pipe向worker发送一个命令,worker发回结果。 We need to do this for a long horizon.
我们需要长期这样做。 This will be sometimes a bit slow due to frequent communications.
由于频繁的通信,这有时会有点慢。
Therefore, I am wondering if by using latest Python feature asyncio combined with Process to replace Pipe, could it be potentially speedup due to IO concurrency, if I understand its functionality correctly.因此,我想知道是否通过使用最新的 Python 功能 asyncio 结合 Process 来替换 Pipe,如果我正确理解了它的功能,它是否可能由于 IO 并发而加速。
Multiprocessing module has already a solution for parallel task processing: multiprocessing.Pool
Multiprocessing 模块已经有并行任务处理的解决方案:
multiprocessing.Pool
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(processes=4) as pool: # start 4 worker processes
print(pool.map(f, range(10))) # prints "[0, 1, 4,..., 81]"
You can achieve the same using multiprocessing.Queue
.您可以使用
multiprocessing.Queue
实现相同的效果。 I believe that's how pool.map()
is implemented internally.我相信这就是
pool.map()
在内部实现的方式。
So, what's the difference between multiprocessing.Queue
and multiprocessing.Pipe
?那么,
multiprocessing.Queue
和multiprocessing.Pipe
之间有什么区别? Queue
is just a Pipe
plus some locking mechanism. Queue
只是一个Pipe
加上一些锁定机制。 Therefore multiple worker processes can share just a single Queue
(or rather 2 - one for commands, one for results), but with Pipe
each process would need it's own Pipe
(or a pair of, or a duplex one), exactly how you are doing it now.因此,多个工作进程可以只共享一个
Queue
(或者更确切地说 2 - 一个用于命令,一个用于结果),但是对于Pipe
每个进程都需要它自己的Pipe
(或一对或双工),这正是您的情况现在做。
The only disadvantage of Queue
is performance - because all processes share one queue mutex it doesn't scale well for many processes. Queue
的唯一缺点是性能 - 因为所有进程共享一个队列互斥锁,所以它不能很好地扩展到许多进程。 To be sure it can handle tens of thousands items/s I would choose Pipe
, but for classic parallel task processing use case I think Queue
or just Pool.map()
could be OK because they are much easier to use.为了确保它可以处理数以万计的项目/秒,我会选择
Pipe
,但对于经典的并行任务处理用例,我认为Queue
或Pool.map()
可以,因为它们更容易使用。 (Managing processes can be tricky and asyncio doesn't make it easier either.) (管理流程可能很棘手,而且 asyncio 也不会让它变得更容易。)
Hope that helps, I'm aware that I've answered a bit different question than you've asked :)希望有所帮助,我知道我回答的问题与您提出的问题略有不同:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.