简体   繁体   English

对象的Python列表占用了太多内存

[英]Python list of Objects taking up too much memory

I have the following code, that creates a million objects of a class foo: 我有以下代码,它创建了类foo的一百万个对象:

for i in range(1000000):
    bar = foo()
    list_bar.append(bar)

The bar object is only 96 bytes, as determined by getsizeof() . bar对象只有96个字节,由getsizeof()确定。 However, the append step takes almost 8GB of ram. 然而,追加步骤需要几乎8GB的内存。 Once the code exits the loop, the ram usage drops to expected amounts (size of the list + some overhead ~103MB). 一旦代码退出循环,ram使用量就会下降到预期的数量(列表的大小+一些开销~103MB)。 Only while the loop is running does the ram usage skyrocket. 只有在循环运行时,ram使用才会飙升。 Why does this happen? 为什么会这样? Any workarounds? 任何解决方法? PS: Using a generator is not an option, it has to be a list. PS:使用生成器不是一个选项,它必须是一个列表。

EDIT: xrange doesn't help, using Python 3. The memory usage stays high only during the loop execution, and drops after the loop is through. 编辑: xrange没有帮助,使用Python 3.内存使用率仅在循环执行期间保持高水平,并在循环结束后下降。 Could append have some non-obvious overhead? 可以append有一些非明显的开销吗?

Most probably this is due to some unintended cyclical references made by the foo() constructor; 很可能这是由于foo()构造函数做出的一些无意的循环引用; as normally Python objects will release memory instantly when the reference count drops to zero; 因为通常Python对象会在引用计数降至零时立即释放内存; now these would be freed later when the garbage collector gets a chance to run. 现在,当垃圾收集器有机会运行时,这些将被释放。

You can try to force the GC run after say 10000 iterations to see if it keeps the memory usage constant. 您可以尝试在10000次迭代后强制运行GC以查看它是否保持内存使用量不变。

import gc
n = 1000000
list_bar = [ None ] * n
for i in range(n):
    list_bar[i] = foo()
    if i % 10000 == 0:
        gc.collect()

If this relieves memory pressure then the memory usage is because of some reference cycles. 如果这样可以减轻内存压力,那么内存使用量就是因为某些参考周期。


The resizing of a list has some overhead. 调整列表的大小有一些开销。 If you know how many elements, then you can create the list beforehand, eg: 如果您知道有多少元素,那么您可以事先创建列表,例如:

list_bar = [ foo() for _ in xrange(1000000) ]

should know the size of the array and not need to resize it; 应该知道数组的大小而不需要调整它的大小; or create the list filled with None : 或创建填充None的列表:

n = 1000000
list_bar = [ None ] * n
for i in range(n):
    list_bar[i] = foo()

append should be using realloc to grow the list, but old memory ought to be released as soon as possible; append应该使用realloc来扩展列表,但旧的内存应该尽快释放; and all in all the overhead of all memory allocated should not sum to 8G for a list that is 100 MB at the end; 并且所有分配的所有内存的开销都不应该总和到8G,最后是100 MB的列表; it can be possible that the operating system is miscalculating the memory used. 操作系统可能错误地计算了所使用的内存。

How are you measuring the memory usage? 你是如何测量内存使用量的?

I suspect your usage of a 3rd party module might be the cause. 我怀疑你使用第三方模块可能是原因。 Perhaps the 3rd party module is temporarily using a lot of memory when initialised. 也许第三方模块在初始化时暂时使用大量内存。

Besides, sys.getsizeof() is not an accurate indication of the memory used by an object. 此外, sys.getsizeof()不是对象使用的内存的准确指示。

For example: 例如:

from sys import getsizeof

class A(object):
    pass

class B(object):
    def __init__(self):
        self.big = 'a' * 1024*1024*1024    # approx. 1 GiB

>>> getsizeof(A)
976
>>> a = A()
>>> getsizeof(a)
64
>>> 
>>> getsizeof(B)
976
>>> b = B()
>>> getsizeof(b)
64
>>> getsizeof(b.big)
1073741873

After instantiating b = B() , top reports approx 1GiB resident memory usage. 在实例化b = B()top报告大约1GiB驻留内存使用量。 Obviously this is not reflected by getsizeof(b) which returns only 64 bytes. 显然,这不会被getsizeof(b)反映出来, getsizeof(b)返回64个字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM