简体   繁体   English

Python Multiprocessing.Pool 惰性迭代

[英]Python Multiprocessing.Pool lazy iteration

I'm wondering about the way that python's Multiprocessing.Pool class works with map, imap, and map_async.我想知道 python 的 Multiprocessing.Pool class 与 map、imap 和 map_async 一起工作的方式。 My particular problem is that I want to map on an iterator that creates memory-heavy objects, and don't want all these objects to be generated into memory at the same time.我的特殊问题是我想要 map 在创建内存密集型对象的迭代器上,并且不希望所有这些对象同时生成到 memory 中。 I wanted to see if the various map() functions would wring my iterator dry, or intelligently call the next() function only as child processes slowly advanced, so I hacked up some tests as such:我想看看各种 map() 函数是否会让我的迭代器干涸,或者仅在子进程缓慢推进时智能地调用 next() function,所以我破解了一些测试:

def g():
  for el in xrange(100):
    print el
    yield el

def f(x):
  time.sleep(1)
  return x*x

if __name__ == '__main__':
  pool = Pool(processes=4)              # start 4 worker processes
  go = g()
  g2 = pool.imap(f, go)
  g2.next()

And so on with map, imap, and map_async.依此类推 map、imap 和 map_async。 This is the most flagrant example however, as simply calling next() a single time on g2 prints out all my elements from my generator g(), whereas if imap were doing this 'lazily' I would expect it to only call go.next() once, and therefore print out only '1'.然而,这是最明显的例子,因为简单地在 g2 上调用一次 next() 会打印出我的生成器 g() 中的所有元素,而如果 imap 是“懒惰地”这样做,我希望它只调用 go.next () 一次,因此只打印出'1'。

Can someone clear up what is happening, and if there is some way to have the process pool 'lazily' evaluate the iterator as needed?有人可以弄清楚发生了什么,是否有某种方法可以让进程池“懒惰地”根据需要评估迭代器?

Thanks,谢谢,

Gabe加贝

Let's look at the end of the program first.我们先来看看程序的结尾。

The multiprocessing module uses atexit to call multiprocessing.util._exit_function when your program ends. multiprocessing 模块使用atexit在程序结束时调用multiprocessing.util._exit_function

If you remove g2.next() , your program ends quickly.如果您删除g2.next() ,您的程序会很快结束。

The _exit_function eventually calls Pool._terminate_pool . _exit_function最终调用Pool._terminate_pool The main thread changes the state of pool._task_handler._state from RUN to TERMINATE .主线程将pool._task_handler._state的状态从RUN更改为TERMINATE Meanwhile the pool._task_handler thread is looping in Pool._handle_tasks and bails out when it reaches the condition同时pool._task_handler线程在Pool._handle_tasks循环并在达到条件时退出

            if thread._state:
                debug('task handler found thread._state != RUN')
                break

(See /usr/lib/python2.6/multiprocessing/pool.py) (见/usr/lib/python2.6/multiprocessing/pool.py)

This is what stops the task handler from fully consuming your generator, g() .这就是阻止任务处理程序完全消耗您的生成器g() If you look in Pool._handle_tasks you'll see如果你查看Pool._handle_tasks你会看到

        for i, task in enumerate(taskseq):
            ...
            try:
                put(task)
            except IOError:
                debug('could not put task on queue')
                break

This is the code which consumes your generator.这是消耗您的生成器的代码。 ( taskseq is not exactly your generator, but as taskseq is consumed, so is your generator.) taskseq不完全是您的生成器,但是随着taskseq被消耗,您的生成器也是如此。)

In contrast, when you call g2.next() the main thread calls IMapIterator.next , and waits when it reaches self._cond.wait(timeout) .相反,当您调用g2.next() ,主线程调用IMapIterator.next ,并在到达self._cond.wait(timeout)时等待。

That the main thread is waiting instead of calling _exit_function is what allows the task handler thread to run normally, which means fully consuming the generator as it put s tasks in the worker s' inqueue in the Pool._handle_tasks function.主线程在等待,而不是调用_exit_function是什么让任务处理线程正常运行,这意味着完全消耗发生器,它put小号任务的worker S' inqueuePool._handle_tasks功能。

The bottom line is that all Pool map functions consume the entire iterable that it is given.最重要的是,所有Pool映射函数都会消耗给定的整个可迭代对象。 If you'd like to consume the generator in chunks, you could do this instead:如果你想分块使用生成器,你可以这样做:

import multiprocessing as mp
import itertools
import time


def g():
    for el in xrange(50):
        print el
        yield el


def f(x):
    time.sleep(1)
    return x * x

if __name__ == '__main__':
    pool = mp.Pool(processes=4)              # start 4 worker processes
    go = g()
    result = []
    N = 11
    while True:
        g2 = pool.map(f, itertools.islice(go, N))
        if g2:
            result.extend(g2)
            time.sleep(1)
        else:
            break
    print(result)

I had this problem too and was disappointed to learn that map consumes all its elements.我也有这个问题,并且很失望地得知 map 消耗了它的所有元素。 I coded a function which consumes the iterator lazily using the Queue data type in multiprocessing.我编写了一个函数,该函数在多处理中使用 Queue 数据类型懒惰地使用迭代器。 This is similar to what @unutbu describes in a comment to his answer but as he points out, suffers from having no callback mechanism for re-loading the Queue.这类似于@unutbu 在对他的回答的评论中所描述的内容,但正如他指出的那样,没有用于重新加载队列的回调机制。 The Queue datatype instead exposes a timeout parameter and I've used 100 milliseconds to good effect. Queue 数据类型公开了一个超时参数,我使用了 100 毫秒,效果很好。

from multiprocessing import Process, Queue, cpu_count
from Queue import Full as QueueFull
from Queue import Empty as QueueEmpty

def worker(recvq, sendq):
    for func, args in iter(recvq.get, None):
        result = func(*args)
        sendq.put(result)

def pool_imap_unordered(function, iterable, procs=cpu_count()):
    # Create queues for sending/receiving items from iterable.

    sendq = Queue(procs)
    recvq = Queue()

    # Start worker processes.

    for rpt in xrange(procs):
        Process(target=worker, args=(sendq, recvq)).start()

    # Iterate iterable and communicate with worker processes.

    send_len = 0
    recv_len = 0
    itr = iter(iterable)

    try:
        value = itr.next()
        while True:
            try:
                sendq.put((function, value), True, 0.1)
                send_len += 1
                value = itr.next()
            except QueueFull:
                while True:
                    try:
                        result = recvq.get(False)
                        recv_len += 1
                        yield result
                    except QueueEmpty:
                        break
    except StopIteration:
        pass

    # Collect all remaining results.

    while recv_len < send_len:
        result = recvq.get()
        recv_len += 1
        yield result

    # Terminate worker processes.

    for rpt in xrange(procs):
        sendq.put(None)

This solution has the advantage of not batching requests to Pool.map.此解决方案的优点是不将请求批量发送到 Pool.map。 One individual worker can not block others from making progress.一名工人不能阻止其他人取得进步。 YMMV.天啊。 Note that you may want to use a different object to signal termination for the workers.请注意,您可能希望使用不同的对象来为工作人员发出终止信号。 In the example, I've used None.在示例中,我使用了 None。

Tested on "Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32"在“Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32”上测试

What you want is implemented in the NuMap package, from the website:你想要的是在NuMap包中实现的,来自网站:

NuMap is a parallel (thread- or process-based, local or remote), buffered, multi-task, itertools.imap or multiprocessing.Pool.imap function replacement. NuMap 是一个并行(基于线程或进程,本地或远程)、缓冲、多任务、itertools.imap 或 multiprocessing.Pool.imap 函数替换。 Like imap it evaluates a function on elements of a sequence or iterable, and it does so lazily.像 imap 一样,它对序列或可迭代元素的函数求值,而且它是惰性的。 Laziness can be adjusted via the “stride” and “buffer” arguments.懒惰可以通过“stride”和“buffer”参数进行调整。

In this example (see code, please) 2 workers.在这个例子中(请参阅代码)2 个工人。

Pool work as expected: when worker is free, then to do next iteration.池按预期工作:当工人空闲时,然后进行下一次迭代。

This code as code in topic, except one thing: argument size = 64 k.此代码作为主题中的代码,除了一件事:参数大小 = 64 k。

64 k - default socket buffer size. 64 k - 默认套接字缓冲区大小。

import itertools
from multiprocessing import Pool
from time import sleep


def f( x ):
    print( "f()" )
    sleep( 3 )
    return x


def get_reader():
    for x in range( 10 ):
        print( "readed: ", x )
        value = " " * 1024 * 64 # 64k
        yield value


if __name__ == '__main__':

    p = Pool( processes=2 )

    data = p.imap( f, get_reader() )

    p.close()
    p.join()

I ran into this issue as well, and came to a different solution than the other answers here so I figured I would share it.我也遇到了这个问题,并得出了与此处其他答案不同的解决方案,所以我想我会分享它。

import collections, multiprocessing

def map_prefetch(func, data, lookahead=128, workers=16, timeout=10):
    with multiprocessing.Pool(workers) as pool:
        q = collections.deque()
        for x in data:
            q.append(pool.apply_async(func, (x,)))
            if len(q) >= lookahead:
                yield q.popleft().get(timeout=timeout)
        while len(q):
            yield q.popleft().get(timeout=timeout)

for x in map_prefetch(myfunction, huge_data_iterator):
    # do stuff with x

Basically is uses a queue to send at most lookahead pending requests to the worker pool, enforcing a limit on buffered results.基本上是使用队列将最多lookahead挂起的请求发送到工作池,从而对缓冲结果实施限制。 The work starts asap within that limit so it can run in parallel.工作在该限制内尽快开始,因此它可以并行运行。 Also the result remains in order.结果也保持秩序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM