切片numpy數組時減少內存使用量

Question

我無法在Python中釋放內存。 情況基本上是這樣的：我有一個大數據集，分為4個文件。 每個文件包含5000個numpy形狀數組（3072，412）的列表。 我試圖將每個數組的第10至20列提取到一個新列表中。

我要做的是依次讀取每個文件，提取所需的數據，並釋放我正在使用的內存，然后再繼續下一個文件。 但是，刪除對象，將其設置為None並將其設置為0，然后調用gc.collect()似乎無效。 這是我正在使用的代碼片段：

num_files=4
start=10
end=20           
fields = []
for j in range(num_files):
    print("Working on file ", j)
    source_filename = base_filename + str(j) + ".pkl"
    print("Memory before: ", psutil.virtual_memory())
    partial_db = joblib.load(source_filename)
    print("GC tracking for partial_db is ",gc.is_tracked(partial_db))
    print("Memory after loading partial_db:",psutil.virtual_memory())
    for x in partial_db:
        fields.append(x[:,start:end])
    print("Memory after appending to fields: ",psutil.virtual_memory())
    print("GC Counts before del: ", gc.get_count())
    partial_db = None
    print("GC Counts after del: ", gc.get_count())
    gc.collect()
    print("GC Counts after collection: ", gc.get_count())
    print("Memory after freeing partial_db: ", psutil.virtual_memory())

這是幾個文件后的輸出：

Working on file  0
Memory before:  svmem(total=67509161984, available=66177449984,percent=2.0, used=846712832, free=33569669120, active=27423051776, inactive=5678043136, buffers=22843392, cached=33069936640, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC Counts before del:  (0, 7, 3)
GC Counts after del:  (0, 7, 3)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
Working on file  1
Memory before:  svmem(total=67509161984, available=40785944576, percent=39.6, used=26238181376, free=8014237696, active=54070542336, inactive=4540620800, buffers=22892544, cached=33233850368, shared=15945728)
GC tracking for partial_db is  True
Memory after loading partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
Memory after appending to fields:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)
GC Counts before del:  (0, 4, 2)
GC Counts after del:  (0, 4, 2)
GC Counts after collection:  (0, 0, 0)
Memory after freeing partial_db:  svmem(total=67509161984, available=15378006016, percent=77.2, used=51626561536, free=265465856, active=62507155456, inactive=3761905664, buffers=10330112, cached=15606804480, shared=15945728)

如果我繼續放任不管，它將耗盡所有內存並觸發MemoryError異常。

有誰知道我該怎么做才能確保partial_db使用的數據被釋放？

Answer 1

問題是這樣的：

for x in partial_db:
    fields.append(x[:,start:end])

切片numpy數組（與普通的Python列表不同）的原因實際上不需要任何時間，也不會浪費空間，原因是它不會創建副本，而只是在數組內存中創建另一個視圖。 通常，那很棒。 但是在這里，這意味着即使在釋放x本身之后，您仍要保留x的內存，因為您永遠不會釋放切片的那些。

還可以采用其他方法，但是最簡單的方法是僅附加切片的副本：

for x in partial_db:
    fields.append(x[:,start:end].copy())

切片numpy數組時減少內存使用量

問題描述

1 個解決方案

解決方案1
8 已采納 2018-05-06 00:50:39

切片numpy數組時減少內存使用量

問題描述

1 個解決方案

解決方案1 8 已采納 2018-05-06 00:50:39

解決方案1
8 已采納 2018-05-06 00:50:39