如何使用 tf.data.Dataset.from_generator() 从数据集中一次只加载一批？

Question

I want to train a CNN and I am trying to feed the model with one batch at a time, directly from a numpy memmap, not having to load the whole dateset to the memory, using tf.data.Dataset.from_generator() .我想训练一个 CNN，并且我试图一次为模型提供一个批次，直接从numpy memmap 中提供，而不必使用tf.data.Dataset.from_generator()将整个数据集加载到内存中。 I am using tf2.2 and the GPU for fitting.我正在使用tf2.2和 GPU 进行拟合。 The dataset is a sequence of 3D matrices (NCHW format).数据集是一系列 3D 矩阵（NCHW 格式）。 The label of each case is the next 3D matrix.每个案例的标签是下一个 3D 矩阵。 The problem is that it still loads the whole dataset to the memory.问题是它仍然将整个数据集加载到内存中。

Here is a short reproducible example:这是一个简短的可重现示例：

import numpy as np
from numpy.lib.format import open_memmap
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.config.list_physical_devices("GPU")


# create and initialize the memmap
ds_shape = (20000, 3, 50, 50)
ds_mmap = open_memmap("ds.npy",
                      mode='w+',
                      dtype=np.dtype("float64"),
                      shape=ds_shape)
ds_mmap = np.random.rand(*ds_shape)

len_ds = len(ds_mmap)          # 20000
len_train = int(0.6 * len_ds)  # 12000
len_val = int(0.2 * len_ds)    # 4000
len_test = int(0.2 * len_ds)   # 4000
batch_size = 32
epochs = 50

I tried 2 ways of generating train-val-test datasets (Also, if anyone could comment on pros and cons, it would be more than welcome)我尝试了 2 种生成 train-val-test 数据集的方法（另外，如果有人可以评论利弊，那将是非常受欢迎的）

1. 1.

def gen(ds_mmap, start, stop):
  for i in range(start, stop):
    yield (ds_mmap[i], ds_mmap[i + 1])

tvt = {"train": None, "val": None, "test": None}
tvt_limits = {
  "train": (0, len_train),
  "val": (len_train, len_train + len_val),
  "test": (len_train + len_val, len_ds -1)  # -1 because the last case does not have a label
}

for ds_type, ds in tvt.items():
  start, stop = tvt_limits[ds_type]
  ds = tf.data.Dataset.from_generator(
    generator=gen,
    output_types=(tf.float64, tf.float64),
    output_shapes=(ds_shape[1:], ds_shape[1:]),
    args=[ds_mmap, start, stop]
  )

train_ds = (
  tvt["train"]
  .shuffle(len_ds, reshuffle_each_iteration=False)
  .batch(batch_size)
)
val_ds = tvt["val"].batch(batch_size)
test_ds = tvt["test"].batch(batch_size)

def gen(ds_mmap):
  for i in range(len(ds_mmap) - 1):
    yield (ds_mmap[i], ds_mmap[i + 1])

ds = tf.data.Dataset.from_generator(
  generator=gen,
  output_types=(tf.float64, tf.float64),
  output_shapes=(ds_shape[1:], ds_shape[1:])
  args=[ds_mmap]
)

train_ds = (
  ds
  .take(len_train)
  .shuffle(len_ds, reshuffle_each_iteration=False)
  .batch(batch_size)
)
val_ds = ds.skip(len_train).take(len_val).batch(batch_size)
test_ds = ds.skip(len_train + len_val).take(len_test - 1).batch(batch_size)

Both ways work, but will bring the whole dataset to the memory.两种方式都有效，但会将整个数据集带入内存。

model = keras.Sequential([
  layers.Conv2D(64, (3, 3), input_shape=ds_shape[1:],
                activation="relu", data_format="channels_first"),
  layers.MaxPooling2D(data_format="channels_first"),
  layers.Conv2D(128, (3, 3),
                activation="relu", data_format="channels_first"),
  layers.MaxPooling2D(data_format="channels_first"),
  layers.Flatten(),
  layers.Dense(8182, activation="relu"),
  layers.Dense(np.prod(ds_shape[1:])),
  layers.Reshape(ds_shape[1:])
])

model.compile(loss="mean_aboslute_error",
              optimizer="adam",
              metrics=[tf.keras.metrics.MeanSquaredError()])

hist = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs,
  # steps_per_epoch=len_train // batch_size,
  # validation_steps=len_val // batch_size,
  shuffle=True
)

Answer 1

An alternative was to subclass keras.utils.Sequence .另一种方法是将keras.utils.Sequence子类化。 The idea is to generate the whole batch.这个想法是生成整个批次。

Quoting the docs:引用文档：

Sequence are a safer way to do multiprocessing.序列是一种更安全的多处理方式。 This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.这种结构保证了网络在每个 epoch 的每个样本上只训练一次，而生成器不是这种情况。

To do so, it is needed to provide __len__() and __getitem__() methods.为此，需要提供__len__()和__getitem__()方法。

For the current example:对于当前示例：

class DS(keras.utils.Sequence):
  
  def __init__(self, ds_mmap, start, stop, batch_size):
    self.ds = ds_mmap[start: stop]
    self.batch_size = batch_size

  def __len__(self):
    # divide-ceil
    return -(-len(self.ds) // self.batch_size)

  def __getitem__(self, idx):
    start = idx * self.batch_size
    stop = (idx + 1) * self.batch_size
    batch_y = self.ds[start + 1: stop + 1]
    batch_x = self.ds[start: stop][: len(batch_y)]
    return batch_x, batch_y

for ds_type, ds in tvt.items():
  start, stop = tvt_limits[ds_type]
  ds = DS(ds_mmap, start, stop, batch_size)

In that case, it is needed to explicitly define the number of steps and NOT pass a batch_size :在这种情况下，需要明确定义步骤数而不是传递batch_size ：

hist = model.fit(
  tvt["train"],
  validation_data=tvt["val"],
  epochs=epochs,
  steps_per_epoch=len_train // batch_size,
  validation_steps=len_val // batch_size,
  shuffle=True
)

Still, I didn't get from_generator() to work and I would like to know how.不过，我没有让from_generator()工作，我想知道如何。

如何使用 tf.data.Dataset.from_generator() 从数据集中一次只加载一批？

问题描述

1 个解决方案

解决方案1
0 2020-10-30 11:24:12

如何使用 tf.data.Dataset.from_generator() 从数据集中一次只加载一批？

问题描述

1 个解决方案

解决方案1 0 2020-10-30 11:24:12

解决方案1
0 2020-10-30 11:24:12