简体   繁体   English

Python多处理失控内存

[英]Python Multiprocessing Runaway Memory

I have an expensive function to perform on many independent objects that is trivially parallel, so I'm trying to use the multiprocessing module. 我有一个昂贵的函数来执行许多平行的独立对象,因此我尝试使用多处理模块。 However, the memory consumption seems to be on a runaway up-and-to-the-right trajectory. 但是,内存消耗似乎在不断变化的轨迹上。 See the attached image below. 请参见下面的图片。

Essentially I have a list of paths to large binary objects. 本质上,我有一个指向大型二进制对象的路径列表。 I have a class that I instantiate with this list. 我有一个用该列表实例化的类。 In this class's __iter__ method, I read the file from disk and yield it. 在此类的__iter__方法中,我从磁盘读取了文件并产生了文件。 The idea is that I iterate through this list of objects (which reads the file into memory) and perform some expensive operation. 这个想法是我遍历对象列表(将文件读入内存)并执行一些昂贵的操作。 Below is some sample code to simulate this. 下面是一些模拟此的示例代码。 I'm using np.random.rand(100,100) to simulate the reading of the large file into memory, and I'm only just indexing the [0,0] element if the matrix in the simulated expensive function. 我正在使用np.random.rand(100,100)模拟将大文件读入内存,并且如果模拟的昂贵函数中的矩阵只是索引[0,0]元素。

import numpy as np
from pathos.multiprocessing import ProcessingPool as Pool
from memory_profiler import profile

class MyClass:
    def __init__(self, my_list):
        self.name = 'foo'
        self.my_list = my_list

    def __iter__(self):
        for item in self.my_list:
            yield np.random.rand(100,100)

def expensive_function(foo):
    foo[0,0]

my_list = range(100000)
myclass = MyClass(my_list)

iter(myclass) # should not return anything

p = Pool(processes=4, maxtasksperchild=50)
p.map(expensive_function, iter(myclass), chunksize=100)

The issue can be seen in the plot. 在图中可以看到问题。 The memory consumption just seems to climb and climb. 内存消耗似乎在不断攀升。 I would expect the total memory consumption to be ~4x the consumption of each individual child process, but that doesn't seem to be the case. 我希望总内存消耗约为每个子进程消耗的4倍,但事实并非如此。

在此处输入图片说明

What's causing this runaway memory usage, and how do I fix it? 是什么原因导致这种内存使用失控,我该如何解决?

Each time that a child begins to invoke expensive_function , it's receiving a new np.random.rand(100,100) array from MyClass.__iter__ . 每一个孩子开始调用时间expensive_function ,它接收到新的np.random.rand(100,100)从阵列MyClass.__iter__ These arrays are persisting in the main process, so of course the memory usage continues to grow; 这些数组在主进程中一直存在,因此内存使用量当然会继续增长。 the child processes aren't able to clean these up, they exist in the parent process. 子进程无法清除它们,它们存在于父进程中。 Note how the peak is a little under 8 GiB, or about how much data you should expect to generate (100000 arrays with 100x100 entries, 8 bytes per entry) 请注意,峰值在8 GiB以下是多少,或者您期望产生多少数据(100000个具有100x100条目的数组,每个条目8个字节)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM