简体   繁体   English

Python C扩展-内存泄漏

[英]Python C extension - memory leaks

I'm relatively new to Python and this is my first attempt at writing a C extension. 我刚接触Python,这是我第一次尝试编写C扩展。

Background In my Python 3.X project I need to load and parse large binary files (10-100MB) to extract data for further processing. 背景信息在我的Python 3.X项目中,我需要加载和解析大型二进制文件(10-100MB)以提取数据以进行进一步处理。 The binary content is organized in frames: headers followed by a variable amount of data. 二进制内容按帧组织:标头后跟可变数量的数据。 Due to the low performance in Python I decided to go for a C extension to speedup the loading part. 由于Python的性能低下,我决定采用C扩展来加快加载速度。

The standalone C code outperforms Python by a factor in between 20x-500x so I am pretty satisfied with it. 独立的C代码比Python的性能高20到500倍,因此我对此非常满意。

The problem: the memory keeps growing when I invoke the function from my C-extension multiple times within the same Python module. 问题是:当我在同一个Python模块中多次从C扩展调用函数时,内存一直在增长。


my_c_ext.c my_c_ext.c

#include <Python.h>
#include <numpy/arrayobject.h>
#include "my_c_ext.h"

static unsigned short *X, *Y;

static PyObject* c_load(PyObject* self, PyObject* args)
{
    char *filename;
    if(!PyArg_ParseTuple(args, "s", &filename))
        return NULL;

    PyObject *PyX, *PyY;

    __load(filename); 

    npy_intp dims[1] = {n_events};

    PyX = PyArray_SimpleNewFromData(1, dims, NPY_UINT16, X);
    PyArray_ENABLEFLAGS((PyArrayObject*)PyX, NPY_ARRAY_OWNDATA);

    PyY = PyArray_SimpleNewFromData(1, dims, NPY_UINT16, Y);
    PyArray_ENABLEFLAGS((PyArrayObject*)PyY, NPY_ARRAY_OWNDATA);

    PyObject *xy = Py_BuildValue("NN", PyX, PyY);


    return xy;
}

...

//More Python C-extension boilerplate (methods, etc..)

...

void __load(char *) {

    // open file, extract frame header and compute new_size
    X = realloc(X, new_size * sizeof(*X));
    Y = realloc(Y, new_size * sizeof(*Y));

    X[i] = ...
    Y[i] = ...

    return;
}

test.py test.py

import my_c_ext as ce

binary_files = ['file1.bin',...,'fileN.bin']

for f in binary_files:
    x,y = ce.c_load(f)
    del x,y

Here I am deleting the returned objects in hope of lowering memory usage. 在这里,我将删除返回的对象,以减少内存使用量。

After reading several posts (eg this , this and this ), I am still stuck. 在阅读了几篇文章(例如thisthisthis )之后,我仍然陷入困境。

I tried to add/remove the PyArray_ENABLEFLAGS setting the NPY_ARRAY_OWNDATA flag without experiencing any difference. 我尝试添加/删除PyArray_ENABLEFLAGS设置NPY_ARRAY_OWNDATA标志没有遇到任何区别。 It is not yet clear to me if the NPY_ARRAY_OWNDATA implies a free(X) in C. If I explicitly free the arrays in C, I ran into a segfault when trying to load second file in the for loop in test.py . 目前尚不清楚,我如果NPY_ARRAY_OWNDATA意味着free(X)在C.如果我明确地释放在C中的数组,我遇到了一个segfault试图加载第二个文件中的在循环时test.py

Any idea of what am I doing wrong? 知道我在做什么错吗?

This looks like a memory management disaster. 这看起来像是内存管理灾难。 NPY_ARRAY_OWNDATA should cause it to call free on the data (or at least PyArray_free which isn't necessarily the same thing...). NPY_ARRAY_OWNDATA应该使它对数据进行free调用(或至少PyArray_free ,这不一定是同一件事……)。

However once this is done you still have the global variables X and Y pointing to a now-invalid area of memory. 但是一旦做到这一点,你仍然有全局变量XY指向的内存现在已经无效区域。 You then call realloc on those invalid pointers. 然后,您可以对那些无效的指针调用realloc At this point you're well into undefined behaviour and so anything could happen. 此时,您很容易陷入未定义的行为,因此任何事情都可能发生。


If it's a global variable then the memory needs to be managed globally, not by Numpy. 如果是全局变量,则需要全局管理内存,而不是由Numpy管理。 If the memory is managed by the Numpy array then you need to ensure that you store no other way to access it except through that Numpy array. 如果内存是由Numpy阵列管理的,那么您需要确保除了通过该Numpy阵列进行存储以外,没有其他存储方式来访问它。 Anything else is going to cause you problems. 其他任何事情都会给您带来麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM