Keras fit_generator() 没有正确训练

Question

I am trying to create an image classifier using Keras and TensorFlow 2.0.0 backend.我正在尝试使用 Keras 和 TensorFlow 2.0.0 后端创建图像分类器。

I'm training this model on my local machine on a custom dataset containing a total of 17~ thousand images.我在我的本地机器上训练这个 model 的自定义数据集，其中包含总共 17~ 千张图像。 The images vary in size and are located in three different folders (training, validation, and test), each containing two subfolders (one for each class).这些图像大小不一，位于三个不同的文件夹（训练、验证和测试）中，每个文件夹包含两个子文件夹（每个类别一个）。 I tried an architecture similar to VGG16, which yielded more than decent results on this dataset in the past.我尝试了一个类似于 VGG16 的架构，过去在这个数据集上产生了非常好的结果。 Note, there is a minor class imbalance in the data (52:48)请注意，数据中存在轻微的 class 不平衡 (52:48)

When I call fit_generator() , the model doesn't train well;当我调用fit_generator()时，model 训练不好； although the training loss lowers slightly throughout the first epoch, it does not change much afterward.尽管在第一个 epoch 中训练损失略有下降，但之后变化不大。 Using this architecture with higher regulation, I achieved 85% accuracy after 55~ epochs in the past.使用这种具有更高规则的架构，我在过去 55~ epochs 后达到了 85% 的准确率。

Imports and hyperparameters导入和超参数

import tensorflow as tf
from tensorflow import keras
from keras import backend as k
from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten, Input, UpSampling2D
from keras.models import Sequential, Model, load_model
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint

TRAIN_PATH = 'data/train/'
VALID_PATH = 'data/validation/'
TEST_PATH = 'data/test/'
TARGET_SIZE = (256, 256)
RESCALE = 1.0 / 255
COLOR_MODE = 'grayscale'
EPOCHS = 2
BATCH_SIZE = 16
CLASSES = ['Damselflies', 'Dragonflies']
CLASS_MODE = 'categorical'
CHECKPOINT = "checkpoints/weights.hdf5"

Model Model

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
                 input_shape=(256, 256, 1), padding='same'))

model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Flatten())
model.add(Dense(516, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='Adam', metrics=['accuracy'])

In the past, I created a custom pipeline to reshape, grayscale, flip, and normalize the images;过去，我创建了一个自定义管道来对图像进行整形、灰度、翻转和标准化； then, I trained the model using my CPU on batches of processed images.然后，我使用我的 CPU 在批量处理的图像上训练了 model。

I tried repeating the process using ImageDataGenerator, flow_from_directory, and GPU support.我尝试使用 ImageDataGenerator、flow_from_directory 和 GPU 支持重复该过程。

# randomly flip images, and scale pixel values
trainGenerator = ImageDataGenerator(rescale=RESCALE, 
                                    horizontal_flip=True,  
                                    vertical_flip=True)

# only scale the pixel values validation images
validatioinGenerator = ImageDataGenerator(rescale=RESCALE)

# only scale the pixel values test images
testGenerator = ImageDataGenerator(rescale=RESCALE)

# instanciate train flow
trainFlow = trainGenerator.flow_from_directory(
    TRAIN_PATH,
    target_size = TARGET_SIZE,
    batch_size = BATCH_SIZE,
    classes = CLASSES,
    color_mode = COLOR_MODE,
    class_mode = CLASS_MODE,
    shuffle=True
) 

# instanciate validation flow
validationFlow = validatioinGenerator.flow_from_directory(
    VALID_PATH,
    target_size = TARGET_SIZE,
    batch_size = BATCH_SIZE,
    classes = CLASSES,
    color_mode = COLOR_MODE,
    class_mode= CLASS_MODE,
    shuffle=True
)

Then, fitting the model using fit_generator.然后，使用 fit_generator 拟合 model。

checkpoints = ModelCheckpoint(CHECKPOINT, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

with tf.device('/GPU:0'):
    model.fit_generator(
        trainFlow,
        validation_data=validationFlow, 
        callbacks=[checkpoints],
        epochs=EPOCHS
    )

I tried training it for 40 epochs.我尝试训练它 40 个 epoch。 The classifier achieves 52% after the first epoch and does not improve as time goes by.分类器在第一个 epoch 后达到 52%，并且随着时间的推移没有提高。

Testing the classifier测试分类器

testFlow = testGenerator.flow_from_directory(
    TEST_PATH,
    target_size = TARGET_SIZE,
    batch_size = BATCH_SIZE,
    classes = CLASSES,
    color_mode = COLOR_MODE,
    class_mode= CLASS_MODE,
)

ans = model.predict_generator(testFlow)

When I look at the predictions, the model predicts all the test images as the majority class with the same confidence [0.48498476, 0.51501524] .当我查看预测时，model 将所有测试图像预测为大多数 class 具有相同的置信度[0.48498476, 0.51501524] 。

Have I made sure the data is correct?我确定数据是正确的吗？

Yes.是的。 I tested whether the generators yield processed images and their corresponding labels correctly.我测试了生成器是否正确生成处理后的图像及其相应的标签。

Have I tried changing the loss function, activation function, and optimizer?我是否尝试过更改损失 function、激活 function 和优化器？

Yes.是的。 I tried changing the class mode to binary, the loss to binary_crossentropy, and changing the last layer to produce a single output with sigmoid activation.我尝试将 class 模式更改为二进制，将损失更改为 binary_crossentropy，并更改最后一层以生成具有 sigmoid 激活的单个 output。 No, I did not change the optimizer.不，我没有更改优化器。 However, I did try to increase the learning rate.但是，我确实尝试提高学习率。

Have I tried changing the model's architecture?我是否尝试过更改模型的架构？

Yes.是的。 I tried increasing and decreasing model complexity.我尝试增加和减少 model 复杂性。 Both more layers with less regularization and fewer layers with more regularization produced similar results.具有较少正则化的更多层和具有更多正则化的更少层都产生了相似的结果。

Are the layers trainable?这些层是可训练的吗？

Yes.是的。

Is the GPU support implemented correctly? GPU 支持是否正确实施？

I hope so.但愿如此。

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available: 1可用 GPU 数量：1

a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') 
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') 
c = tf.matmul(a, b)

config = tf.compat.v1.ConfigProto(log_device_placement=True) 
config.gpu_options.allow_growth = True 
sess = tf.compat.v1.Session(config=config)
print(sess)

Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce GTX 1050 with Max-Q Design, pci bus id: 0000:03:00.0, compute capability: 6.1设备映射：/job:localhost/replica:0/task:0/device:GPU:0 -> 设备：0，名称：NVIDIA GeForce GTX 1050 with Max-Q Design，pci 总线 ID：0000:03:00.0，计算能力：6.1

<tensorflow.python.client.session.Session object at 0x000001F9443E2CC0> <tensorflow.python.client.session.Session object at 0x000001F9443E2CC0>

Have I tried transfer learning?我是否尝试过迁移学习？

Not yet.还没有。

I found a similar unanswered question from 2017 keras-doesnt-train-using-fit-generator .我从 2017 年的 keras-doesnt-train-using-fit-generator中发现了一个类似的未回答问题。

Thoughts?想法？

Answer 1

The problem is with your model.问题出在您的 model 上。 I copied your code and ran it on a data set I have used before (which gets high accuracy) and got results similar to yours.我复制了你的代码并在我以前使用过的数据集上运行它（它得到了很高的准确性），得到的结果与你的相似。 I then substituted the simple model below然后我替换了下面的简单 model

model = tf.keras.Sequential([
    Conv2D(16, 3, padding='same', activation='relu', input_shape=(256 , 256,1)),
    MaxPooling2D(),
    Conv2D(32, 3, padding='same', activation='relu' ),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(.3),
    Dense(64, activation='relu'),
    Dropout(.3),
    Dense(2, activation='softmax')
])
model.compile(loss='categorical_crossentropy',
              optimizer='Adam', metrics=['accuracy'])

The model trained properly. model 训练得当。 By the way model.fit_generator is depreciated.顺便说一句 model.fit_generator 已折旧。 You can now just use model.fit which can now handle generators.您现在可以只使用 model.fit 现在可以处理生成器。 I then took your model and removed all the dropout layers except for the last one and your model trained properly.然后我拿走了你的 model 并删除了除最后一层之外的所有辍学层，并且你的 model 训练有素。 Code is:代码是：

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
                 input_shape=(256, 256, 1), padding='same'))

model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
#model.add(Dropout(0.1))

model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
#model.add(Dropout(0.1))

model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
#model.add(Dropout(0.1))

model.add(Flatten())
model.add(Dense(516, activation='relu'))
#model.add(Dropout(0.1))

model.add(Dense(128, activation='relu'))
model.add(Dropout(0.1))

model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='Adam', metrics=['accuracy'])

Answer 2

@Gerry P, @格里P，

By accident, I found what's causing the error.偶然，我发现了导致错误的原因。 Removing from Keras import backend as k resolved the model's inability to learn. from Keras import backend as k中移除解决了模型无法学习的问题。

That's not all.那不是全部。 I also identified that the model you defined, not calling ModelCheckpoint, and not customizing class names affected the fitting process.我还发现您定义的 model、未调用 ModelCheckpoint 和未自定义 class 名称影响了拟合过程。

model = Sequential([
    Conv2D(16, 3, padding='same', activation='relu', input_shape=(256 , 256, 1)),
    MaxPooling2D(),
    Conv2D(32, 3, padding='same', activation='relu' ),
    MaxPooling2D(),
    Conv2D(64, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(128, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Conv2D(256, 3, padding='same', activation='relu'),
    MaxPooling2D(),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(.3),
    Dense(64, activation='relu'),
    Dropout(.3),
    Dense(2, activation='softmax')
])

I commented that import to try and resolve an error that occurred when I copy-pasted your sequential model.我评论了导入以尝试解决在我复制粘贴您的顺序 model 时发生的错误。 Then, I forgot to uncomment it when I tested it beautiful or average dataset.然后，我在测试漂亮或平均数据集时忘记取消注释它。 I achieved over 80% accuracy after the third epoch.在第三个 epoch 之后，我达到了 80% 以上的准确率。 Then, I reverted the changes and tried it on my dataset, and it failed again.然后，我恢复了更改并在我的数据集上进行了尝试，但它再次失败了。 As a bonus, not importing Keras's backend decreased the time it takes to train the model!作为奖励，不导入 Keras 的后端减少了训练模型所需的时间！

Lately, I had to re-install Keras and TensorFlow because they couldn't detect my GPU anymore.最近，我不得不重新安装 Keras 和 TensorFlow 因为他们再也检测不到我的 GPU。 I probably made a mistake and installed an incompatible version of Keras.我可能犯了一个错误，安装了不兼容的 Keras 版本。

CUDA==10.0
tensorflow-gpu==2.0.0
keras==2.3.1

Note, it's still not a 100% solution, and the problems arise every so often.请注意，它仍然不是 100% 的解决方案，而且问题经常出现。

EDIT:编辑：

Whenever it doesn't work, simplify the model.每当它不起作用时，简化 model。 Changed batch size and stopped learning?更改批量大小并停止学习？ Simplify the model.简化 model。 Augmented the images further and stopped learning?进一步增强图像并停止学习？ Simplify the model.简化 model。

Keras fit_generator() 没有正确训练

问题描述

2 个解决方案

解决方案1
1 2021-05-07 17:00:48

解决方案2
0 2021-05-07 23:51:10

Keras fit_generator() 没有正确训练

问题描述

2 个解决方案

解决方案1 1 2021-05-07 17:00:48

解决方案2 0 2021-05-07 23:51:10

解决方案1
1 2021-05-07 17:00:48

解决方案2
0 2021-05-07 23:51:10