[英]How do I train a neural network in Keras on data stored in HDF5 files?
I have two fairly large PyTables EArrays which contain the inputs and labels for a regression task. 我有两个相当大的PyTables EArray,其中包含回归任务的输入和标签。 The input array is 4d (55k x 128 x 128 x 3) and the label array is 1d (55k). 输入数组为4d(55k x 128 x 128 x 3),标签数组为1d(55k)。 I have a NN architecture specified in Keras which I want to train on this data, but there are two problems. 我有一个在Keras中指定的NN体系结构,我想在此数据上进行训练,但是有两个问题。
How can I select subsets of the HDF5 arrays (input and output) according to train/test indices and train on the training subsets, without reading them into memory all at once? 如何根据训练/测试索引选择HDF5阵列的子集(输入和输出)并训练训练子集,而不必一次将它们全部读入内存? Is there some way to create a "view" of the on-disk array that can be sliced and that Keras will see as a regular NumPy ndarray? 有什么方法可以创建磁盘切片的“视图”,并且可以将Keras视为常规NumPy ndarray?
What I've tried so far is to convert my arrays to Keras HDF5Matrix objects (with eg X = keras.utils.io_utils.HDF5Matrix(X)
), but when I then slice this to get a training split, the full slice (80% of the full array) gets put into memory, which gives me a MemoryError
. 到目前为止,我一直在尝试将数组转换为Keras HDF5Matrix对象(例如X = keras.utils.io_utils.HDF5Matrix(X)
),但是当我对其进行切片以进行训练分割时,将整个切片(80整个数组的%)放入内存,这给了我一个MemoryError
。
You can use the fit_generator method of your keras
model. 您可以使用keras
模型的fit_generator方法。 Just write your own generator class/function that pulls random batches of samples from your HDF5 file. 只需编写您自己的生成器类/函数,即可从HDF5文件中提取随机批次的样本。 That way, you never have to have all the data in memory at once. 这样,您就不必一次将所有数据存储在内存中。 Similarly, if your validation data are too large to fit in memory, the validation_data
argument to fit_generator
also accepts a generator that produces batches from your validation data. 同样,如果您的验证数据太大而无法容纳在内存中,则fit_generator
的validation_data
参数也会接受一个生成器,该生成器会根据您的验证数据生成批处理。
Essentially, you just need to do an np.random.shuffle
on an array of indices into your data set, then split the random index array into training, validation, and testing array indices. 本质上,您只需要对一组索引数组执行np.random.shuffle
到数据集中,然后将随机索引数组拆分为训练,验证和测试数组索引。 Your generator arguments to fit_generator
will just pull batches from your HDF5 file according to sequential batches of indices in the training and validation index arrays. 您针对fit_generator
生成器参数fit_generator
根据训练和验证索引数组中的索引的连续批次从HDF5文件中提取批次。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.