切片numpy数组时减少内存使用量

Question

I'm having trouble freeing up memory in Python. 我无法在Python中释放内存。 The situation is basically this: I have a large dataset split into 4 files. 情况基本上是这样的：我有一个大数据集，分为4个文件。 Each file contains a list of 5000 numpy arrays of shape (3072, 412). 每个文件包含5000个numpy形状数组（3072，412）的列表。 I'm trying to extract, say, the 10th through 20th columns of each array into a new list. 我试图将每个数组的第10至20列提取到一个新列表中。

What I'd like to do is sequentially read each file, extract the data I need, and the free up the memory I'm using before moving on to the next one. 我要做的是依次读取每个文件，提取所需的数据，并释放我正在使用的内存，然后再继续下一个文件。 However, deleting the object, setting it to None and setting it to 0 followed by a call to gc.collect() doesn't seem to work. 但是，删除对象，将其设置为None并将其设置为0，然后调用gc.collect()似乎无效。 Here's the snippet of code I'm working with: 这是我正在使用的代码片段：

num_files=4
start=10
end=20           
fields = []
for j in range(num_files):
    print("Working on file ", j)
    source_filename = base_filename + str(j) + ".pkl"
    print("Memory before: ", psutil.virtual_memory())
    partial_db = joblib.load(source_filename)
    print("GC tracking for partial_db is ",gc.is_tracked(partial_db))
    print("Memory after loading partial_db:",psutil.virtual_memory())
    for x in partial_db:
        fields.append(x[:,start:end])
    print("Memory after appending to fields: ",psutil.virtual_memory())
    print("GC Counts before del: ", gc.get_count())
    partial_db = None
    print("GC Counts after del: ", gc.get_count())
    gc.collect()
    print("GC Counts after collection: ", gc.get_count())
    print("Memory after freeing partial_db: ", psutil.virtual_memory())

and here's the output after a couple of files: 这是几个文件后的输出：

Working on file  0
Memory before:  svmem(total=67509161984, available=66177449984,percent=2.0, used=846712832, free=33569669120, active=27423051776, inactive=5678043136, buffers=22843392, cached=33069936640, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC Counts before del:  (0, 7, 3)
GC Counts after del:  (0, 7, 3)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Working on file  1
Memory before:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
GC Counts before del:  (0, 4, 2)
GC Counts after del:  (0, 4, 2)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)

If I keep letting it go it will use up all the memory and trigger a MemoryError exception. 如果我继续放任不管，它将耗尽所有内存并触发MemoryError异常。

Anyone know what I can do to make sure the data used by partial_db gets freed? 有谁知道我该怎么做才能确保partial_db使用的数据被释放？

Answer 1

The problem is this: 问题是这样的：

for x in partial_db:
    fields.append(x[:,start:end])

The reason slicing numpy arrays (unlike normal Python lists) takes virtually no time and no wasted space is that it doesn't make a copy, it just creates another view into the array's memory. 切片numpy数组（与普通的Python列表不同）的原因实际上不需要任何时间，也不会浪费空间，原因是它不会创建副本，而只是在数组内存中创建另一个视图。 Normally, that's great. 通常，那很棒。 But here, it means that you're keeping the memory for x alive even after you release x itself, because you're never releasing those sliced. 但是在这里，这意味着即使在释放x本身之后，您仍要保留x的内存，因为您永远不会释放切片的那些。

There are other ways around this, but the simplest is to just append copies of the slices: 还可以采用其他方法，但是最简单的方法是仅附加切片的副本：

for x in partial_db:
    fields.append(x[:,start:end].copy())

切片numpy数组时减少内存使用量

问题描述

1 个解决方案

解决方案1
8 已采纳 2018-05-06 00:50:39

切片numpy数组时减少内存使用量

问题描述

1 个解决方案

解决方案1 8 已采纳 2018-05-06 00:50:39

解决方案1
8 已采纳 2018-05-06 00:50:39