简体   繁体   English

在pool.imap_unordered上进行迭代

[英]Iteration over pool.imap_unordered

Consider very simple code: 考虑非常简单的代码:

#!/usr/bin/python

from multiprocessing import Pool
import random

def f(x):
    return x*x

def sampleiter(n):
    num = 0
    while num < n:
     rand = random.random()
     yield rand
     num += 1

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes
    for item in pool.imap_unordered(f, sampleiter(100000000000000), 20):
     print item
    pool.close

While running in the terminal, Python leaking memory. 在终端中运行时,Python会泄漏内存。
What could be wrong? 有什么事吗

Output buffering isn't the problem (or at least, not the only one), because (a) the Python process itself grows in memory, and (b) if you redirect to /dev/null it still happens. 输出缓冲不是问题(或者至少不是唯一的问题),因为(a)Python进程本身在内存中增长,以及(b)如果您重定向到/dev/null它仍然会发生。

I think the issue is that when you print out the results, the pool is returning results much faster than they can be consumed, and so lots and lots of results are sitting in memory. 我认为问题在于,当您打印出结果时,池返回的结果快于可消耗的结果,因此很多结果都存储在内存中。 If you look at the source of the class that does this , intermediate results are stored in the collections.deque called _items ; 如果查看执行此操作的类的源,则中间结果存储在名为_itemscollections.deque I'd wager that _items is getting huge. 我敢打赌_items越来越大。

I'm not entirely sure how to test this, though, because even though imap_unordered returns an instance of this class you still seem to only be able to get at the generator methods: 不过,我不确定如何测试此方法,因为即使imap_unordered返回imap_unordered的实例,您似乎仍然只能使用generator方法:

In [8]: r = pool.imap_unordered(f, sampleiter(1e8), 20)

In [9]: print dir(r)
['__class__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__',
 '__init__', '__iter__', '__name__', '__new__', '__reduce__', '__reduce_ex__', 
 '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 
 'close', 'gi_code', 'gi_frame', 'gi_running', 'next', 'send', 'throw']

Update: if you add a time.sleep(.01) to f() , memory usage stays completely constant. 更新:如果将time.sleep(.01)添加到f() ,则内存使用将保持完全恒定。 So, yeah, the problem is that you're producing results faster than you can use them. 所以,是的,问题在于您产生结果的速度超过了使用结果的速度。

(As an aside: you mean pool.close() at the end of your code sample; pool.close is just a reference to the function and doesn't actually call it.) pool.close()pool.close() :您的意思是在代码示例末尾处pool.close()pool.close只是对该函数的引用,实际上并未调用它。)

The only variable I see here that causes the memory leak is your print statement. 我在这里看到的唯一导致内存泄漏的变量是您的打印语句。 When I replace print item with pass , the memory stays low and constant. 当我用pass替换print item时,内存保持不变。 I am not sure exactly what is happening under the hood when you do print, but its obviously stacking something up and not freeing. 我不确定在进行打印时引擎盖下到底发生了什么,但是显然它堆积了一些东西而没有释放。 Also, when I lower your chunk size to 1, the memory increases much more slowly (obviously), but also takes longer. 另外,当我将块大小减小到1时,内存增加的速度要慢得多(显然),但是会花费更长的时间。 So it does multiply the memory usage. 因此,它确实会增加内存使用量。

Update 更新资料

Found this as a specific reference to memory usage increasing due to the terminal's history buffer (not the python process itself): Memory leak when running python in Mac OS Terminal 发现这是由于终端的历史记录缓冲区(而不是python进程本身)而导致内存使用增加的特定参考: 在Mac OS Terminal中运行python时发生内存泄漏

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM