为什么我在 Keras 中的 resnet50 model 不收敛？

Question

I am currently trying to classify integrated circuits in defect and non defect images.我目前正在尝试对缺陷和非缺陷图像中的集成电路进行分类。 I already tried VGG16 and InceptionV3 and got really good results for both (95% validation accuracy and low val loss).我已经尝试过 VGG16 和 InceptionV3 并且两者都得到了非常好的结果（95% 的验证准确率和低 val 损失）。 Now I wanted to try resnet50 but my model is not converging.现在我想尝试 resnet50，但我的 model 没有收敛。 Its accuracy is at 95 % too but the validation loss keeps increasing while the val acc gets stuck at 50 %.它的准确率也达到了 95%，但验证损失不断增加，而 val acc 卡在 50%。

This is my script so far:到目前为止，这是我的脚本：

from keras.applications.resnet50 import ResNet50
from keras.optimizers import Adam
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout
from keras import backend as K
from keras_preprocessing.image import ImageDataGenerator
import tensorflow as tf

class ResNet:
    def __init__(self):
        self.img_width, self.img_height = 224, 224  # Dimensions of cropped image
        self.classes_num = 2  # Number of classifications

        # Training configurations
        self.epochs = 32
        self.batch_size = 16  # Play with this to determine number of images to train on per epoch
        self.lr = 0.0001

    def build_model(self, train_path):
        train_data_path = train_path
        train_datagen = ImageDataGenerator(rescale=1. / 255, validation_split=0.25)

        train_generator = train_datagen.flow_from_directory(
            train_data_path,
            target_size=(self.img_height, self.img_width),
            color_mode="rgb",
            batch_size=self.batch_size,
            class_mode='categorical',
            subset='training')

        validation_generator = train_datagen.flow_from_directory(
            train_data_path,
            target_size=(self.img_height, self.img_width),
            color_mode="rgb",
            batch_size=self.batch_size,
            class_mode='categorical',
            subset='validation')

        # create the base pre-trained model
        base_model = ResNet50(weights='imagenet', include_top=False, input_shape=    (self.img_height, self.img_width, 3))

        # add a global spatial average pooling layer
        x = base_model.output
        x = GlobalAveragePooling2D()(x)
        # let's add a fully-connected layer
        x = Dense(1024, activation='relu')(x)
        #x = Dropout(0.3)(x)
        # and a logistic layer -- let's say we have 200 classes
        predictions = Dense(2, activation='softmax')(x)

        # this is the model we will train
        model = Model(inputs=base_model.input, outputs=predictions)

        # first: train only the top layers (which were randomly initialized)
        # i.e. freeze all convolutional InceptionV3 layers
        for layer in base_model.layers:
            layer.trainable = True

        # compile the model (should be done *after* setting layers to non-trainable)
        opt = Adam(self.lr)  # , decay=self.INIT_LR / self.NUM_EPOCHS)
        model.compile(opt, loss='binary_crossentropy', metrics=["accuracy"])

        # train the model on the new data for a few epochs
        from keras.callbacks import ModelCheckpoint, EarlyStopping
        import matplotlib.pyplot as plt

        checkpoint = ModelCheckpoint('resnetModel.h5', monitor='val_accuracy', verbose=1, save_best_only=True,
                                 save_weights_only=False, mode='auto', period=1)

        early = EarlyStopping(monitor='val_accuracy', min_delta=0, patience=16, verbose=1, mode='auto')
        hist = model.fit_generator(steps_per_epoch=self.batch_size, generator=train_generator,
                               validation_data=validation_generator, validation_steps=self.batch_size, epochs=self.epochs,
                               callbacks=[checkpoint, early])

        plt.plot(hist.history['accuracy'])
        plt.plot(hist.history['val_accuracy'])
        plt.plot(hist.history['loss'])
        plt.plot(hist.history['val_loss'])
        plt.title("model accuracy")
        plt.ylabel("Accuracy")
        plt.xlabel("Epoch")
        plt.legend(["Accuracy", "Validation Accuracy", "loss", "Validation Loss"])
        plt.show()

        plt.figure(1)

import tensorflow as tf

if __name__ == '__main__':
    x = ResNet()
    config = tf.compat.v1.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.compat.v1.Session(config=config)
    x.build_model("C:/Users/but/Desktop/dataScratch/Train")

And this is the training of the model这是model的训练

What could be the reason for resnet to fail but for vgg and inception to work?除了 vgg 和 inception 工作之外，resnet 失败的原因可能是什么？ Do I have any mistakes in my script?我的脚本有错误吗？

Answer 1

At least for the code, I don't see any mistakes that might affect the training process.至少对于代码，我没有看到任何可能影响训练过程的错误。

# and a logistic layer -- let's say we have 200 classes
predictions = Dense(2, activation='softmax')(x)

Those lines are a bit suspicious.这些台词有点可疑。 But it seems that the typo is on the comment, so it should be okay.但是好像评论里有错别字，所以应该没问题。

# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
    layer.trainable = True

These are suspicious too.这些也很可疑。 If you want to freeze the ResNet-50's layers, what you need to do is如果你想冻结 ResNet-50 的层，你需要做的是

...
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(self.img_height, self.img_width, 3))
for layer in base_model.layers:
    layer.trainable = False
...

But it turned out that layer.trainable = True was actually your intention, so it wouldn't matter either.但事实证明layer.trainable = True实际上是你的意图，所以也没关系。

First of all, if you are using the same code which you used for training VGG16 and Inception V3, it is unlikely that the code is the problem.首先，如果您使用用于训练 VGG16 和 Inception V3 的相同代码，则代码不太可能是问题所在。

Why don't you check following susceptible reasons?为什么不检查以下易感原因？

The model may be too small/big that it underfits/overfits. model 可能太小/太大而无法适应/过度适应。 (Number of Parameters) （参数数量）
The model may need more time to converge. model 可能需要更多时间来收敛。 (Training for more epochs) （训练更多时期）
The ResNet may not be suited for this classification. ResNet 可能不适合这种分类。
Pretrained weights that you used may not be suited for this classification.您使用的预训练权重可能不适合此分类。
The learning rate may be too small/big.学习率可能太小/太大。
etc... ETC...

为什么我在 Keras 中的 resnet50 model 不收敛？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-08-11 05:29:02

为什么我在 Keras 中的 resnet50 model 不收敛？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-08-11 05:29:02

解决方案1
2 已采纳 2020-08-11 05:29:02