简体   繁体   中英

how to read batches in one hdf5 data file for training?

I have a hdf5 training dataset with size (21760, 1, 33, 33) . 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.

I want to ask:

How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?

If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:

import h5py
import tensorflow as tf

data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
    current_data = data['data_set'][position:position+batch_size]
    sess.run(train_op, feed_dict={input: current_data})

You can also run through a huge number of iterations and randomly select a batch if you want to:

import random
for i in range(iterations):
    pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

Or sequentially:

for i in range(iterations):
    pos = (i % int(data_size / batch_size)) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.

You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :

import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
  sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})

alkamen's approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.

Find below a screenshot of this approach

训练总是重新开始

As can be seen, the loss and accuracy always start afresh. I will be happy if anyone could share a possible way around this, please.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM