简体   繁体   English

Python 多处理:如何限制等待进程的数量?

[英]Python multiprocessing: how to limit the number of waiting processes?

When running a large number of tasks (with large parameters) using Pool.apply_async, the processes are allocated and go to a waiting state, and there is no limit for the number of waiting processes.使用Pool.apply_async运行大量任务(大参数)时,进程被分配并进入等待状态,等待进程数没有限制。 This can end up by eating all memory, as in the example below:这最终可能会耗尽所有内存,如下例所示:

import multiprocessing
import numpy as np

def f(a,b):
    return np.linalg.solve(a,b)

def test():

    p = multiprocessing.Pool()
    for _ in range(1000):
        p.apply_async(f, (np.random.rand(1000,1000),np.random.rand(1000)))
    p.close()
    p.join()

if __name__ == '__main__':
    test()

I'm searching for a way to limit the waiting queue, in such a way that there is only a limited number of waiting processes, and Pool.apply_async is blocked while the waiting queue is full.我正在寻找一种限制等待队列的方法,这种方式只有有限数量的等待进程,并且 Pool.apply_async 在等待队列已满时被阻塞。

multiprocessing.Pool has a _taskqueue member of type multiprocessing.Queue , which takes an optional maxsize parameter; multiprocessing.Pool具有_taskqueue类型的构件multiprocessing.Queue ,这需要一个可选maxsize参数; unfortunately it constructs it without the maxsize parameter set.不幸的是,它在没有maxsize参数集的情况下构建它。

I'd recommend subclassing multiprocessing.Pool with a copy-paste of multiprocessing.Pool.__init__ that passes maxsize to _taskqueue constructor.我建议你继承multiprocessing.Pool用的复制粘贴multiprocessing.Pool.__init__是传球maxsize_taskqueue构造。

Monkey-patching the object (either the pool or the queue) would also work, but you'd have to monkeypatch pool._taskqueue._maxsize and pool._taskqueue._sem so it would be quite brittle:猴子修补对象(池或队列)也可以,但您必须对pool._taskqueue._maxsizepool._taskqueue._sem进行猴子pool._taskqueue._maxsize ,因此它会非常脆弱:

pool._taskqueue._maxsize = maxsize
pool._taskqueue._sem = BoundedSemaphore(maxsize)

Wait if pool._taskqueue is over the desired size:如果pool._taskqueue超过所需大小,请等待:

import multiprocessing
import time

import numpy as np


def f(a,b):
    return np.linalg.solve(a,b)

def test(max_apply_size=100):
    p = multiprocessing.Pool()
    for _ in range(1000):
        p.apply_async(f, (np.random.rand(1000,1000),np.random.rand(1000)))

        while p._taskqueue.qsize() > max_apply_size:
            time.sleep(1)

    p.close()
    p.join()

if __name__ == '__main__':
    test()

Here is a monkey patching alternative to the top answer:这是最佳答案的猴子修补替代方案:

import queue
from multiprocessing.pool import ThreadPool as Pool


class PatchedQueue():
  """
  Wrap stdlib queue and return a Queue(maxsize=...)
  when queue.SimpleQueue is accessed
  """

  def __init__(self, simple_queue_max_size=5000):
    self.simple_max = simple_queue_max_size  

  def __getattr__(self, attr):
    if attr == "SimpleQueue":
      return lambda: queue.Queue(maxsize=self.simple_max)
    return getattr(queue, attr)


class BoundedPool(Pool):
  # Override queue in this scope to use the patcher above
  queue = PatchedQueue()

pool = BoundedPool()
pool.apply_async(print, ("something",))

This is working as expected for Python 3.8 where multiprocessing Pool is using queue.SimpleQueue to setup the task queue.这在 Python 3.8 中按预期工作,其中多处理池使用queue.SimpleQueue来设置任务队列。 It sounds like the implementation for multiprocessing.Pool may have changed since 2.7听起来multiprocessing.Pool的实现可能自 2.7 以来发生了变化

You could add explicit Queue with maxsize parameter and use queue.put() instead of pool.apply_async() in this case.在这种情况下,您可以使用 maxsize 参数添加显式 Queue 并使用queue.put()而不是pool.apply_async() Then worker processes could:然后工作进程可以:

for a, b in iter(queue.get, sentinel):
    # process it

If you want to limit the number of created input arguments/results that are in memory to approximately the number of active worker processes then you could use pool.imap*() methods:如果要将内存中创建的输入参数/结果的数量限制为大约活动工作进程的数量,则可以使用pool.imap*()方法:

#!/usr/bin/env python
import multiprocessing
import numpy as np

def f(a_b):
    return np.linalg.solve(*a_b)

def main():
    args = ((np.random.rand(1000,1000), np.random.rand(1000))
            for _ in range(1000))
    p = multiprocessing.Pool()
    for result in p.imap_unordered(f, args, chunksize=1):
        pass
    p.close()
    p.join()

if __name__ == '__main__':
    main()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM