如何在没有 concat、stack 或 append 的情况下组合两个巨大的 numpy 数组？

Question

I have two numpy arrays of huge size.我有两个巨大的 numpy 数组。 Each array has the shape of (7, 960000, 200) .每个数组的形状为(7, 960000, 200) 。 I want to concatenate them using np.concatenate((arr1, arr2), axis=1) so that the final shape would be (7, 1920000, 200) .我想使用np.concatenate((arr1, arr2), axis=1)连接它们，以便最终形状为(7, 1920000, 200) 。 The problem is, they already filled up my ram, and there is no enough room in the ram to do the concatenation operation, hence, the execution is killed.问题是，他们已经填满了我的 ram，并且 ram 中没有足够的空间来进行串联操作，因此，执行被终止了。 Same thing for the np.stack . np.stack也是如此。 So, I thought of making a new array which points to the two arrays in order, and this new array should have the same effect as combining the arrays;于是，我想到了做一个新的数组，依次指向这两个数组，这个新的数组应该和合并数组的效果一样； they should be contiguous as well.它们也应该是连续的。

So, how to do so?那么，该怎么做呢？ And, is there a better way to combining them than the idea I suggested?而且，有没有比我建议的想法更好的方法来组合它们？

Answer 1

Numpy numpy.memmap() allows for the creation of memory mapped data stored as a binary on disk that can be accessed and interfaced with as if it were a single array. Numpy numpy.memmap()允许创建以二进制形式存储在磁盘上的内存映射数据，可以像单个数组一样访问和接口。 This solution saves the individual arrays you are working with as separate .npy files and then combines them into a single binary file.此解决方案将您正在使用的单个数组保存为单独的 .npy 文件，然后将它们组合成一个二进制文件。

import numpy as np
import os

size = (7,960000,200)

# We are assuming arrays a and b share the same shape, if they do not 
# see https://stackoverflow.com/questions/50746704/how-to-merge-very-large-numpy-arrays
# for an explanation on how to create the new shape

a = np.ones(size) # uses ~16 GB RAM
a = np.transpose(a, (1,0,2))
shape = a.shape
shape[0] *= 2
dtype = a.dtype

np.save('a.npy', a)
a = None # allows for data to be deallocated by garbage collector

b = np.ones(size) # uses ~16 GB RAM
b = np.transpose(b, (1,0,2))
np.save('b.npy', a)
b = None

# Once the size is know create memmap and write chunks
data_files = ['a.npy', 'b.npy']
merged = np.memmap('merged.dat', dtype=dtype, mode='w+', shape=shape)
i = 0
for file in data_files:
    chunk = np.load(file, allow_pickle=True)
    merged[i:i+len(chunk)] = chunk
    i += len(chunk)

merged = np.transpose(merged, (1,0,2))

# Delete temporary numpy .npy files
os.remove('a.npy')
os.remove('b.npy')

Based on: this stackoverflow answer基于： this stackoverflow answer
also check out hdf5 and combining two hdf5 files here .还可以在这里查看hdf5并合并两个 hdf5 文件。 It's another good way of storing large datasets这是存储大型数据集的另一种好方法

如何在没有 concat、stack 或 append 的情况下组合两个巨大的 numpy 数组？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-24 18:49:57

如何在没有 concat、stack 或 append 的情况下组合两个巨大的 numpy 数组？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-24 18:49:57

解决方案1
1 已采纳 2022-05-24 18:49:57