如何在一个hdf5数据文件中读取批次进行培训？

Question

I have a hdf5 training dataset with size (21760, 1, 33, 33) . 我有一个大小为(21760, 1, 33, 33)的hdf5训练数据集。 21760 is the whole number of training samples. 21760是整个训练样本数。 I want to use the mini-batch training data with the size 128 to train the network. 我想使用大小为128的小批量训练数据来训练网络。

I want to ask: 我想问问：

How to feed 128 mini-batch training data from the whole dataset with tensorflow each time? 如何养活128从整个数据集tensorflow每次小批量的训练数据？

Answer 1

If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly: 如果您的数据集太大而无法像keveman建议的那样导入到内存中，则可以直接使用h5py对象：

import h5py
import tensorflow as tf

data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
    current_data = data['data_set'][position:position+batch_size]
    sess.run(train_op, feed_dict={input: current_data})

You can also run through a huge number of iterations and randomly select a batch if you want to: 如果您愿意，还可以运行大量迭代并随机选择批处理：

import random
for i in range(iterations):
    pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

Or sequentially: 或顺序：

for i in range(iterations):
    pos = (i % int(data_size / batch_size)) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. 您可能希望编写一些更复杂的代码，这些代码随机地遍历所有数据，但会跟踪已使用的批次，因此您不会比其他批次更频繁地使用任何批次。 Once you've done a full run through the training set you enable all batches again and repeat. 完成训练集的完整运行后，再次启用所有批次并重复。

Answer 2

You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. 您可以将hdf5数据集读入numpy数组，并将numpy数组的切片提供给TensorFlow模型。 Pseudo code like the following would work : 像下面这样的伪代码可以工作：

import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
  sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})

Answer 3

alkamen's approach seems logically right but I have not gotten any positive results using it. alkamen的方法在逻辑上是正确的，但我没有得到任何积极的结果使用它。 My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. 我最好的猜测是：使用上面的代码示例1，在每次迭代中，网络重新训练，忘记在前一循环中学到的所有内容。 So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten. 因此，如果我们每次迭代获取30个样本或批次，则在每次循环/迭代时，仅使用30个数据样本，然后在下一个循环中，所有内容都被覆盖。

Find below a screenshot of this approach 在下面找到此方法的屏幕截图

As can be seen, the loss and accuracy always start afresh. 可以看出，损失和准确性总是重新开始。 I will be happy if anyone could share a possible way around this, please. 如果有人可以分享一个可行的方法，我会很高兴。

如何在一个hdf5数据文件中读取批次进行培训？

问题描述

3 个解决方案

解决方案1
10 2017-10-28 09:21:14

解决方案2
7 已采纳 2016-07-06 14:31:48

解决方案3
2 2018-06-30 14:56:26

如何在一个hdf5数据文件中读取批次进行培训？

问题描述

3 个解决方案

解决方案1 10 2017-10-28 09:21:14

解决方案2 7 已采纳 2016-07-06 14:31:48

解决方案3 2 2018-06-30 14:56:26

解决方案1
10 2017-10-28 09:21:14

解决方案2
7 已采纳 2016-07-06 14:31:48

解决方案3
2 2018-06-30 14:56:26