简体   繁体   English

如何使用 tf.data.Dataset.from_generator() 从数据集中一次只加载一批?

[英]How to use tf.data.Dataset.from_generator() to load only one batch at a time from the dataset?

I want to train a CNN and I am trying to feed the model with one batch at a time, directly from a numpy memmap, not having to load the whole dateset to the memory, using tf.data.Dataset.from_generator() .我想训练一个 CNN,并且我试图一次为模型提供一个批次,直接从numpy memmap 中提供,而不必使用tf.data.Dataset.from_generator()将整个数据集加载到内存中。 I am using tf2.2 and the GPU for fitting.我正在使用tf2.2和 GPU 进行拟合。 The dataset is a sequence of 3D matrices (NCHW format).数据集是一系列 3D 矩阵(NCHW 格式)。 The label of each case is the next 3D matrix.每个案例的标签是下一个 3D 矩阵。 The problem is that it still loads the whole dataset to the memory.问题是它仍然将整个数据集加载到内存中。

Here is a short reproducible example:这是一个简短的可重现示例:

import numpy as np
from numpy.lib.format import open_memmap
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.config.list_physical_devices("GPU")


# create and initialize the memmap
ds_shape = (20000, 3, 50, 50)
ds_mmap = open_memmap("ds.npy",
                      mode='w+',
                      dtype=np.dtype("float64"),
                      shape=ds_shape)
ds_mmap = np.random.rand(*ds_shape)

len_ds = len(ds_mmap)          # 20000
len_train = int(0.6 * len_ds)  # 12000
len_val = int(0.2 * len_ds)    # 4000
len_test = int(0.2 * len_ds)   # 4000
batch_size = 32
epochs = 50

I tried 2 ways of generating train-val-test datasets (Also, if anyone could comment on pros and cons, it would be more than welcome)我尝试了 2 种生成 train-val-test 数据集的方法(另外,如果有人可以评论利弊,那将是非常受欢迎的)

1. 1.

def gen(ds_mmap, start, stop):
  for i in range(start, stop):
    yield (ds_mmap[i], ds_mmap[i + 1])

tvt = {"train": None, "val": None, "test": None}
tvt_limits = {
  "train": (0, len_train),
  "val": (len_train, len_train + len_val),
  "test": (len_train + len_val, len_ds -1)  # -1 because the last case does not have a label
}

for ds_type, ds in tvt.items():
  start, stop = tvt_limits[ds_type]
  ds = tf.data.Dataset.from_generator(
    generator=gen,
    output_types=(tf.float64, tf.float64),
    output_shapes=(ds_shape[1:], ds_shape[1:]),
    args=[ds_mmap, start, stop]
  )

train_ds = (
  tvt["train"]
  .shuffle(len_ds, reshuffle_each_iteration=False)
  .batch(batch_size)
)
val_ds = tvt["val"].batch(batch_size)
test_ds = tvt["test"].batch(batch_size)
def gen(ds_mmap):
  for i in range(len(ds_mmap) - 1):
    yield (ds_mmap[i], ds_mmap[i + 1])

ds = tf.data.Dataset.from_generator(
  generator=gen,
  output_types=(tf.float64, tf.float64),
  output_shapes=(ds_shape[1:], ds_shape[1:])
  args=[ds_mmap]
)

train_ds = (
  ds
  .take(len_train)
  .shuffle(len_ds, reshuffle_each_iteration=False)
  .batch(batch_size)
)
val_ds = ds.skip(len_train).take(len_val).batch(batch_size)
test_ds = ds.skip(len_train + len_val).take(len_test - 1).batch(batch_size)

Both ways work, but will bring the whole dataset to the memory.两种方式都有效,但会将整个数据集带入内存。

model = keras.Sequential([
  layers.Conv2D(64, (3, 3), input_shape=ds_shape[1:],
                activation="relu", data_format="channels_first"),
  layers.MaxPooling2D(data_format="channels_first"),
  layers.Conv2D(128, (3, 3),
                activation="relu", data_format="channels_first"),
  layers.MaxPooling2D(data_format="channels_first"),
  layers.Flatten(),
  layers.Dense(8182, activation="relu"),
  layers.Dense(np.prod(ds_shape[1:])),
  layers.Reshape(ds_shape[1:])
])

model.compile(loss="mean_aboslute_error",
              optimizer="adam",
              metrics=[tf.keras.metrics.MeanSquaredError()])

hist = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs,
  # steps_per_epoch=len_train // batch_size,
  # validation_steps=len_val // batch_size,
  shuffle=True
)

An alternative was to subclass keras.utils.Sequence .另一种方法是将keras.utils.Sequence子类 The idea is to generate the whole batch.这个想法是生成整个批次。

Quoting the docs:引用文档:

Sequence are a safer way to do multiprocessing.序列是一种更安全的多处理方式。 This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.这种结构保证了网络在每个 epoch 的每个样本上只训练一次,而生成器不是这种情况。

To do so, it is needed to provide __len__() and __getitem__() methods.为此,需要提供__len__()__getitem__()方法。

For the current example:对于当前示例:

class DS(keras.utils.Sequence):
  
  def __init__(self, ds_mmap, start, stop, batch_size):
    self.ds = ds_mmap[start: stop]
    self.batch_size = batch_size

  def __len__(self):
    # divide-ceil
    return -(-len(self.ds) // self.batch_size)

  def __getitem__(self, idx):
    start = idx * self.batch_size
    stop = (idx + 1) * self.batch_size
    batch_y = self.ds[start + 1: stop + 1]
    batch_x = self.ds[start: stop][: len(batch_y)]
    return batch_x, batch_y
for ds_type, ds in tvt.items():
  start, stop = tvt_limits[ds_type]
  ds = DS(ds_mmap, start, stop, batch_size)

In that case, it is needed to explicitly define the number of steps and NOT pass a batch_size :在这种情况下,需要明确定义步骤数而不是传递batch_size

hist = model.fit(
  tvt["train"],
  validation_data=tvt["val"],
  epochs=epochs,
  steps_per_epoch=len_train // batch_size,
  validation_steps=len_val // batch_size,
  shuffle=True
)

Still, I didn't get from_generator() to work and I would like to know how.不过,我没有让from_generator()工作,我想知道如何。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何加速 tf.data.Dataset.from_generator() - How to speed up tf.data.Dataset.from_generator() 如何在 tf.data.Dataset.from_generator 中保留字典键? - How to preserve dict keys in tf.data.Dataset.from_generator? 从不同数组形状的 tf.data.Dataset.from_generator() 创建一个 padded_batch - create a padded_batch from tf.data.Dataset.from_generator() of different array shapes 在 tf.data.Dataset.from_generator() 上应用扩充 - Apply augmentation on tf.data.Dataset.from_generator() 使用 tf.data.Dataset.from_generator() 时的参数化生成器 - Parametrized generators while using tf.data.Dataset.from_generator() 使用 tf.data.Dataset.from_generator 时出错 - Error when using tf.data.Dataset.from_generator 如何使用 tf.data.Dataset.from_generator() 向生成器函数发送参数? - How do you send arguments to a generator function using tf.data.Dataset.from_generator()? 如何使用自定义生成器使tf.data.Dataset.from_generator产生批处理 - How to make tf.data.Dataset.from_generator yield batches with a custom generator 使用 tf.data.Dataset.from_generator() 从生成器加载数据 - Loading data from generator using tf.data.Dataset.from_generator() 是否有一种简单的方法来使用tf.data.Dataset.from_generator中的函数和张量流中的自定义model_fn(Estimator) - is there a simple way to use features from tf.data.Dataset.from_generator with a custom model_fn(Estimator) in tensorflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM