为什么当我尝试使用 fit_generator 训练 Keras 时，它会在第一个 epoch 停止？

Question

I'm using Keras to fine tune an existing VGG16 model and am using a fit_generator to train the last 4 layers.我正在使用 Keras 微调现有的 VGG16 模型，并使用 fit_generator 来训练最后 4 层。 Here's the relevant code that I'm working with:这是我正在使用的相关代码：

# Create the model
model = models.Sequential()

# Add the vgg convolutional base model
model.add(vgg_conv)

# Add new layers
model.add(layers.Flatten())
model.add(layers.Dense(1024, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(5, activation='softmax'))

# Show a summary of the model. Check the number of trainable params
model.summary()
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest')

validation_datagen = ImageDataGenerator(rescale=1./255)

#Change the batchsize according to the system RAM
train_batchsize = 100
val_batchsize = 10

train_dir='training_data/train'
validation_dir='training_data/validation'

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(image_size1, image_size2),
    batch_size=train_batchsize,
    class_mode='categorical')

validation_generator = validation_datagen.flow_from_directory(
    validation_dir,
    target_size=(image_size1, image_size2),
    batch_size=val_batchsize,
    class_mode='categorical',
    shuffle=False)

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

# Train the model
history = model.fit_generator(
    train_generator,
    steps_per_epoch=train_generator.samples/train_generator.batch_size,
    epochs=30,
    validation_data=validation_generator,
    validation_steps=validation_generator.samples/validation_generator.batch_size,
    verbose=1)

The issue is that when I run my script to train the model, it works fine until the actual training begins.问题是当我运行我的脚本来训练模型时，它可以正常工作，直到实际训练开始。 Here, it gets stuck at epoch 1/30.在这里，它卡在了 1/30 纪元。

Layer (type)                 Output Shape              Param #
=================================================================
vgg16 (Model)                (None, 15, 20, 512)       14714688
_________________________________________________________________
flatten_1 (Flatten)          (None, 153600)            0
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              157287424
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024)              0
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 5125
=================================================================
Total params: 172,007,237
Trainable params: 164,371,973
Non-trainable params: 7,635,264
_________________________________________________________________
Found 1989 images belonging to 5 classes.
Found 819 images belonging to 5 classes.
Epoch 1/30

This is no good unfortunately.不幸的是，这不好。 I looked around online and I believe that the problem is in using fit_generator.我在网上环顾四周，我相信问题出在使用 fit_generator 上。 There's something about the code for fit_generator in Keras being buggy. Keras 中 fit_generator 的代码有些问题。 However, most of the other people experiencing issues with the epochs end up getting stuck on later epochs (ex. somebody wants to run it for 20 epochs and it halts on epoch 19/20).然而，大多数遇到 epoch 问题的其他人最终会卡在后面的 epoch 上（例如，有人想运行 20 个 epochs，但它在 19/20 epoch 时停止了）。

How would I go about fixing this issue?我将如何解决这个问题？ This is my first time doing deep learning so I'm incredibly confused and would appreciate any help.这是我第一次进行深度学习，所以我非常困惑，希望得到任何帮助。 Do I just need to move to using model.fit()?我只需要改用model.fit() 吗？

Answer 1

You have to pass a valid integer number to fit_generator() as steps_per_epoch and validation_steps parameters.您必须将有效整数作为steps_per_epoch和validation_steps参数传递给fit_generator() 。 So you can use as follows:所以你可以使用如下：

history = model.fit_generator(
    train_generator,
    steps_per_epoch=train_generator.samples//train_generator.batch_size,
    epochs=30,
    validation_data=validation_generator, validation_steps=validation_generator.samples//validation_generator.batch_size,
    verbose=1)

The second factor I can see that your model has 165M trainable parameter which has huge memory consumption particularly coupled with a high batchsize .我可以看到的第二个因素是您的模型具有165M可训练参数，该参数具有巨大的内存消耗，特别是与高batchsize相结合。 You should use images with lower resolution, note that in many case we can get better results with them.您应该使用分辨率较低的图像，请注意，在许多情况下，我们可以使用它们获得更好的结果。

Answer 2

我有同样的问题并在设置validation_steps=validation_size//batch_size后解决它

Answer 3

I had the same issue on Colab.我在 Colab 上遇到了同样的问题。 I looked at the runtime logs and it said eg: "Filling up shuffle buffer (this may take a while): 1195 of 5000".我查看了运行时日志，它说例如：“填充 shuffle 缓冲区（这可能需要一段时间）：1195 of 5000”。

So for me it was just because the shuffle buffer size was too big and it took ages to load the data into the memory.所以对我来说，这只是因为 shuffle 缓冲区大小太大，将数据加载到内存中需要很长时间。

为什么当我尝试使用 fit_generator 训练 Keras 时，它会在第一个 epoch 停止？

问题描述

3 个解决方案

解决方案1
4 已采纳 2018-11-25 21:13:03

解决方案2
2 2019-12-04 00:46:44

解决方案3
1 2020-06-30 08:03:18

为什么当我尝试使用 fit_generator 训练 Keras 时，它会在第一个 epoch 停止？

问题描述

3 个解决方案

解决方案1 4 已采纳 2018-11-25 21:13:03

解决方案2 2 2019-12-04 00:46:44

解决方案3 1 2020-06-30 08:03:18

解决方案1
4 已采纳 2018-11-25 21:13:03

解决方案2
2 2019-12-04 00:46:44

解决方案3
1 2020-06-30 08:03:18