大HDF5数据集，如何在每个时代后有效地洗牌

Question

我目前正在使用大图像数据集（~60GB）来训练CNN（Keras / Tensorflow）进行简单的分类任务。 这些图像是视频帧，因此在时间上高度相关，因此我在生成巨大的.hdf5文件时已经将数据拖拽了一次...将数据提供给CNN而不必将整个集合立即加载到内存中我写的一个简单的批处理生成器（见下面的代码）。 现在我的问题是：通常建议在每个训练时代之后对数据进行洗牌吗？ （对于SGD收敛的原因？）但是这样做我必须在每个时代之后加载整个数据集并将其洗牌，这正是我想要避免使用批处理生成器......所以：它真的那么重要吗？在每个时代之后对数据集进行洗牌，如果是，我怎么能尽可能有效地做到这一点？ 这是我的批处理生成器的当前代码：

def generate_batches_from_hdf5_file(hdf5_file, batch_size, dimensions, num_classes):
"""
Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
"""
filesize = len(hdf5_file['labels'])

while 1:
    # count how many entries we have read
    n_entries = 0
    # as long as we haven't read all entries from the file: keep reading
    while n_entries < (filesize - batch_size):
        # start the next batch at index 0
        # create numpy arrays of input data (features)
        xs = hdf5_file['images'][n_entries: n_entries + batch_size]
        xs = np.reshape(xs, dimensions).astype('float32')

        # and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
        y_values = hdf5_file['labels'][n_entries:n_entries + batch_size]
        #ys = keras.utils.to_categorical(y_values, num_classes)
        ys = to_categorical(y_values, num_classes)

        # we have read one more batch from this file
        n_entries += batch_size
        yield (xs, ys)

Answer 1

是的，改组提高了性能，因为每次以相同的顺序运行数据可能会使您陷入欠佳的区域。

不要随机播放整个数据。 在数据中创建索引列表，然后将其改组。 然后在索引列表上按顺序移动并使用其值从数据集中选择数据。

大HDF5数据集，如何在每个时代后有效地洗牌

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-02-10 08:09:53

大HDF5数据集，如何在每个时代后有效地洗牌

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-02-10 08:09:53

解决方案1
2 已采纳 2018-02-10 08:09:53