如何使用 tf.data.Dataset.from_generator() 從數據集中一次只加載一批？

Question

我想訓練一個 CNN，並且我試圖一次為模型提供一個批次，直接從numpy memmap 中提供，而不必使用tf.data.Dataset.from_generator()將整個數據集加載到內存中。 我正在使用tf2.2和 GPU 進行擬合。 數據集是一系列 3D 矩陣（NCHW 格式）。 每個案例的標簽是下一個 3D 矩陣。 問題是它仍然將整個數據集加載到內存中。

這是一個簡短的可重現示例：

import numpy as np
from numpy.lib.format import open_memmap
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.config.list_physical_devices("GPU")


# create and initialize the memmap
ds_shape = (20000, 3, 50, 50)
ds_mmap = open_memmap("ds.npy",
                      mode='w+',
                      dtype=np.dtype("float64"),
                      shape=ds_shape)
ds_mmap = np.random.rand(*ds_shape)

len_ds = len(ds_mmap)          # 20000
len_train = int(0.6 * len_ds)  # 12000
len_val = int(0.2 * len_ds)    # 4000
len_test = int(0.2 * len_ds)   # 4000
batch_size = 32
epochs = 50

我嘗試了 2 種生成 train-val-test 數據集的方法（另外，如果有人可以評論利弊，那將是非常受歡迎的）

1.

def gen(ds_mmap, start, stop):
  for i in range(start, stop):
    yield (ds_mmap[i], ds_mmap[i + 1])

tvt = {"train": None, "val": None, "test": None}
tvt_limits = {
  "train": (0, len_train),
  "val": (len_train, len_train + len_val),
  "test": (len_train + len_val, len_ds -1)  # -1 because the last case does not have a label
}

for ds_type, ds in tvt.items():
  start, stop = tvt_limits[ds_type]
  ds = tf.data.Dataset.from_generator(
    generator=gen,
    output_types=(tf.float64, tf.float64),
    output_shapes=(ds_shape[1:], ds_shape[1:]),
    args=[ds_mmap, start, stop]
  )

train_ds = (
  tvt["train"]
  .shuffle(len_ds, reshuffle_each_iteration=False)
  .batch(batch_size)
)
val_ds = tvt["val"].batch(batch_size)
test_ds = tvt["test"].batch(batch_size)

def gen(ds_mmap):
  for i in range(len(ds_mmap) - 1):
    yield (ds_mmap[i], ds_mmap[i + 1])

ds = tf.data.Dataset.from_generator(
  generator=gen,
  output_types=(tf.float64, tf.float64),
  output_shapes=(ds_shape[1:], ds_shape[1:])
  args=[ds_mmap]
)

train_ds = (
  ds
  .take(len_train)
  .shuffle(len_ds, reshuffle_each_iteration=False)
  .batch(batch_size)
)
val_ds = ds.skip(len_train).take(len_val).batch(batch_size)
test_ds = ds.skip(len_train + len_val).take(len_test - 1).batch(batch_size)

兩種方式都有效，但會將整個數據集帶入內存。

model = keras.Sequential([
  layers.Conv2D(64, (3, 3), input_shape=ds_shape[1:],
                activation="relu", data_format="channels_first"),
  layers.MaxPooling2D(data_format="channels_first"),
  layers.Conv2D(128, (3, 3),
                activation="relu", data_format="channels_first"),
  layers.MaxPooling2D(data_format="channels_first"),
  layers.Flatten(),
  layers.Dense(8182, activation="relu"),
  layers.Dense(np.prod(ds_shape[1:])),
  layers.Reshape(ds_shape[1:])
])

model.compile(loss="mean_aboslute_error",
              optimizer="adam",
              metrics=[tf.keras.metrics.MeanSquaredError()])

hist = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs,
  # steps_per_epoch=len_train // batch_size,
  # validation_steps=len_val // batch_size,
  shuffle=True
)

Answer 1

另一種方法是將keras.utils.Sequence子類化。 這個想法是生成整個批次。

引用文檔：

序列是一種更安全的多處理方式。 這種結構保證了網絡在每個 epoch 的每個樣本上只訓練一次，而生成器不是這種情況。

為此，需要提供__len__()和__getitem__()方法。

對於當前示例：

class DS(keras.utils.Sequence):
  
  def __init__(self, ds_mmap, start, stop, batch_size):
    self.ds = ds_mmap[start: stop]
    self.batch_size = batch_size

  def __len__(self):
    # divide-ceil
    return -(-len(self.ds) // self.batch_size)

  def __getitem__(self, idx):
    start = idx * self.batch_size
    stop = (idx + 1) * self.batch_size
    batch_y = self.ds[start + 1: stop + 1]
    batch_x = self.ds[start: stop][: len(batch_y)]
    return batch_x, batch_y

for ds_type, ds in tvt.items():
  start, stop = tvt_limits[ds_type]
  ds = DS(ds_mmap, start, stop, batch_size)

在這種情況下，需要明確定義步驟數而不是傳遞batch_size ：

hist = model.fit(
  tvt["train"],
  validation_data=tvt["val"],
  epochs=epochs,
  steps_per_epoch=len_train // batch_size,
  validation_steps=len_val // batch_size,
  shuffle=True
)

不過，我沒有讓from_generator()工作，我想知道如何。

如何使用 tf.data.Dataset.from_generator() 從數據集中一次只加載一批？

問題描述

1 個解決方案

解決方案1
0 2020-10-30 11:24:12

如何使用 tf.data.Dataset.from_generator() 從數據集中一次只加載一批？

問題描述

1 個解決方案

解決方案1 0 2020-10-30 11:24:12

解決方案1
0 2020-10-30 11:24:12