简体   繁体   English

如何用批次迭代地将ndarray写入.npy文件

[英]How to write ndarray to .npy file iteratively with batches

I am generating large dataset for a machine learning application, which is a numpy array with shape (N,X,Y) .我正在为机器学习应用程序生成大型数据集,它是一个形状为(N,X,Y)的 numpy 数组。 Here N is the number of samples, X is the input of a sample and Y is the target of a sample.这里N是样本数, X是样本的输入, Y是样本的目标。 I want to save this array in the .npy format.我想以.npy格式保存这个数组。 I have many samples ( N is very large) so that the final dataset is about 10+ GB.我有很多样本( N非常大),因此最终数据集大约为 10+ GB。 This means that I cannot create the whole dataset and then save it, as it will flood my memory.这意味着我无法创建整个数据集然后保存它,因为它会淹没我的 memory。

Is it possible to instead to write batches of n samples iteratively to this file?是否可以改为将n样本的批次迭代写入该文件? So, I want to append for example batches of 256 samples to the file at once ( (256,X,Y) ).因此,我想一次将 append 例如批量 256 个样本写入文件( (256,X,Y) )。

I figured out it is possible using np.tofile and np.fromfile .我发现可以使用np.tofilenp.fromfile Note the code below still assumes you have the whole array in memory, but you can of course change the batches to be generated dynamically.请注意,下面的代码仍然假设您在 memory 中拥有整个数组,但您当然可以更改要动态生成的批次。

import numpy as np

N = 1000;
X = 10;
Y = 1;
my_data = np.random.random((N, X, Y));
print(my_data[700,:,:])

batch_size = 10;

with open('test.dat',mode='wb+') as f:
    i = 0;
    while i < N:
        batch = my_data[i:i+batch_size,:,:]
        batch.tofile(f)

        i += batch_size;

x = np.fromfile('test.dat',dtype=my_data.dtype)

x = np.reshape(x, (N,X,Y))
print(x[700,:,:])

As @hpaulj mentioned, this file cannot be loaded with np.load .正如@hpaulj 提到的,这个文件不能用np.load加载。

Here is a solution based on numpy's implementaion of save to write a standard npy file including shape and type information:这是一个基于 numpy 的save实现来编写包含形状和类型信息的标准npy文件的解决方案:

import numpy as np
import numpy.lib as npl

a = np.random.random((30, 3, 2))
a1 = a[:10]
a2 = a[10:]

filename = 'out.npy'
with open(filename, 'wb+') as f:
    header = npl.format.header_data_from_array_1_0(a1)
    npl.format.write_array_header_1_0(f, header)
    a1.tofile(f)
    a2.tofile(f)
    f.seek(0)
    header['shape'] = (len(a1) + len(a2), *header['shape'][1:])
    npl.format.write_array_header_1_0(f, header)

assert (np.load(filename) == a).all()

This works for C_CONTIGUOUS arrays without Python objects.这适用于没有 Python 对象的C_CONTIGUOUS arrays。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM