大HDF5數據集，如何在每個時代后有效地洗牌

Question

我目前正在使用大圖像數據集（~60GB）來訓練CNN（Keras / Tensorflow）進行簡單的分類任務。 這些圖像是視頻幀，因此在時間上高度相關，因此我在生成巨大的.hdf5文件時已經將數據拖拽了一次...將數據提供給CNN而不必將整個集合立即加載到內存中我寫的一個簡單的批處理生成器（見下面的代碼）。 現在我的問題是：通常建議在每個訓練時代之后對數據進行洗牌嗎？ （對於SGD收斂的原因？）但是這樣做我必須在每個時代之后加載整個數據集並將其洗牌，這正是我想要避免使用批處理生成器......所以：它真的那么重要嗎？在每個時代之后對數據集進行洗牌，如果是，我怎么能盡可能有效地做到這一點？ 這是我的批處理生成器的當前代碼：

def generate_batches_from_hdf5_file(hdf5_file, batch_size, dimensions, num_classes):
"""
Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
"""
filesize = len(hdf5_file['labels'])

while 1:
    # count how many entries we have read
    n_entries = 0
    # as long as we haven't read all entries from the file: keep reading
    while n_entries < (filesize - batch_size):
        # start the next batch at index 0
        # create numpy arrays of input data (features)
        xs = hdf5_file['images'][n_entries: n_entries + batch_size]
        xs = np.reshape(xs, dimensions).astype('float32')

        # and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
        y_values = hdf5_file['labels'][n_entries:n_entries + batch_size]
        #ys = keras.utils.to_categorical(y_values, num_classes)
        ys = to_categorical(y_values, num_classes)

        # we have read one more batch from this file
        n_entries += batch_size
        yield (xs, ys)

Answer 1

是的，改組提高了性能，因為每次以相同的順序運行數據可能會使您陷入欠佳的區域。

不要隨機播放整個數據。 在數據中創建索引列表，然后將其改組。 然后在索引列表上按順序移動並使用其值從數據集中選擇數據。

大HDF5數據集，如何在每個時代后有效地洗牌

問題描述

1 個解決方案

解決方案1
2 已采納 2018-02-10 08:09:53

大HDF5數據集，如何在每個時代后有效地洗牌

問題描述

1 個解決方案

解決方案1 2 已采納 2018-02-10 08:09:53

解決方案1
2 已采納 2018-02-10 08:09:53