大Numpy Arrays的高效串联

Question

I'm running a process that creates a very large number of feature vectors (as numpy arrays) and stacks them into a single array.我正在运行一个创建大量特征向量（如 numpy 数组）并将它们堆叠成单个数组的进程。 This process is currently very memory intensive and I'm looking for a more memory efficient way to run it.这个过程目前是非常密集的 memory，我正在寻找一种更有效的 memory 方法来运行它。

Currently I generate feature vectors in batches of 100000 and concatenate them together.目前我以 100000 个批量生成特征向量并将它们连接在一起。

all_features = None

for i in range(0, num_entries, 100000):
    features = get_features(entries[i:i+100000]) # generate a batch of 100,000 feature vectors
    features = np.array(features)

    if all_features is not None:
        all_features = np.concatenate([all_features, features])
    else:
        all_features = features

    del features
    gc.collect()

I've found that iteratively concatenating feature vectors, then deleting the intermediate features object is more memory efficient than generating all features at once and concatenating them all at once.我发现迭代连接特征向量，然后删除中间features object 比一次生成所有特征并一次连接它们更有效。 I believe this is because np.concatenate allocates a new object in memory.我相信这是因为np.concatenate在 memory 中分配了一个新的 object。 (Trying to generate all feature vectors at once, then concatenating blows up memory). （尝试一次生成所有特征向量，然后连接会炸毁内存）。

That said, it gets to a point where running the concatenation near the end of the loop still requires about 30 GB of memory (which is immediately freed after the concatenation is run).也就是说，在循环结束附近运行串联仍然需要大约 30 GB 的 memory（在串联运行后立即释放）。

Basically I have enough memory on my instance to store the full feature set, but the memory jumps from packing things into a single array make me run out of memory.基本上，我的实例上有足够的 memory 来存储完整的功能集，但是 memory 从将东西打包到单个数组中跳了出来，这让我用完了 memory。

Is there a more memory efficient way of running this?是否有更有效的 memory 运行方式？

Answer 1

If total size of all_features is known, I'd suggest to allocate it in advance all_features=np.zeros(...) and then populate it in the loop.如果 all_features 的总大小已知，我建议提前分配它all_features=np.zeros(...)然后在循环中填充它。 Thus you get rid of multiple reallocations, deletions and np.concatenate() calls.因此，您摆脱了多次重新分配、删除和 np.concatenate() 调用。

Answer 2

Make your get_features function a generator and then use np.fromiter to create the array.使您的get_features function 成为生成器，然后使用np.fromiter创建数组。

Simple Example简单示例

def gen_values():
    for i in range(1000000): 
        yield i

a = np.fromiter(gen_values(), dtype=int)

You need to specify the dtype in np.fromiter and you can optionaly specify the number of elements to get from the generator with count .您需要在np.fromiter中指定dtype ，并且可以选择使用count指定从生成器获取的元素数量。 While it is optional, it is much better to specify count such that numpy can pre-allocate the output array, instead of resizing it on demand.虽然它是可选的，但最好指定count ，以便 numpy 可以预分配 output 数组，而不是按需调整其大小。

大Numpy Arrays的高效串联

问题描述

2 个解决方案

解决方案1
1 2020-05-04 18:30:12

解决方案2
1 2020-05-04 18:30:51

Simple Example简单示例

大Numpy Arrays的高效串联

问题描述

2 个解决方案

解决方案1 1 2020-05-04 18:30:12

解决方案2 1 2020-05-04 18:30:51

Simple Example简单示例

解决方案1
1 2020-05-04 18:30:12

解决方案2
1 2020-05-04 18:30:51