RNN 和 CNN-RNN 不会正确训练，总是预测一类

Question

I am currently developing a model to detect emotion from text using deep learning algorithms.我目前正在开发一种使用深度学习算法从文本中检测情感的模型。 I have a relatively small labelled dataset(~7500) with 7 different emotions as classes.我有一个相对较小的标记数据集（~7500），其中包含 7 种不同的情绪作为类。 I developed a CNN and achieved an accuracy of ~63% but when I tried to apply a RNN, using LSTM, and a CNN-RNN, also using LSTM, they just don't seem to train properly at all and always end up predicting the same class.我开发了一个 CNN 并达到了约 63% 的准确率，但是当我尝试应用 RNN（使用 LSTM）和 CNN-RNN（也使用 LSTM）时，它们似乎根本没有正确训练，并且总是最终预测同一个班级。 I believe my models to be fundamentally sound but with some mistakes with the parameters.我相信我的模型从根本上来说是合理的，但在参数方面存在一些错误。 I have the dataset split up into 85% for training, with a further 20% of that for validation, and the remaining 15% for testing.我将数据集分成 85% 用于训练，另外 20% 用于验证，其余 15% 用于测试。 My embedding matrix is developed using the word representations from the Google News word2vec and the word index is developed using keras Tokenizer.我的嵌入矩阵是使用 Google 新闻 word2vec 中的单词表示开发的，单词索引是使用 keras Tokenizer 开发的。

Dataset breakdown:数据集细分：

Emotion情感

anger 1086愤怒 1086

disgust 1074厌恶 1074

fear 1086恐惧 1086

guilt 1062内疚 1062

joy 1089欢乐 1089

sadness 1080悲伤 1080

shame 1058耻辱 1058

CNN implementation CNN实现

def make_model(kernel_sizes, num_filters, dropout, hidden_units):

    submodels = []
    for kernel_size in kernel_sizes:
        submodel = Sequential()

        submodel.add(Embedding(input_dim = input_dim,
                            output_dim   = output_dim,
                            weights      = [embedding_matrix],
                            input_length = max_len,
                            trainable    = True))

        submodel.add(Conv1D(filters=num_filters, kernel_size=kernel_size, padding='same',activation='relu',strides=1))
        submodel.add(GlobalMaxPooling1D())
        submodels.append(submodel)

    submodel_outputs = [model.output for model in submodels]    
    submodel_inputs = [model.input for model in submodels]

    merged = Concatenate(axis=1)(submodel_outputs)
    x = Dropout(dropout)(merged)

    if(hidden_units > 0):
        x = Dense(hidden_units, activation='relu')(x)
        x = Dropout(dropout)(x)

    x = Dense(7,activation='softmax', kernel_initializer="uniform")(x)
    out = Activation('sigmoid')(x)

    model = Model(submodel_inputs, out)
    model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

    return model

def fit_model(model, kernel_sizes, num_epochs, batch_size, x_train, y_train):

    x_train = [x_train]*len(kernel_sizes)

    history = model.fit(x_train, y_train, batch_size=batch_size, epochs=num_epochs, validation_split=0.2)

    return history

kernel_sizes  = [2,6]
num_filters   = 100
dropout       = 0.6
num_hidden    = 270
callbacks     = callbacks_list
num_epochs    = 15
batch_size = 64
model = make_model(kernel_sizes, num_filters, dropout, num_hidden)
print(model.summary())
history = fit_model(model, kernel_sizes, num_epochs, batch_size, x_train, y_train)

Model: "model_1"型号：“model_1”

Layer (type) Output Shape Param # Connected to层（类型）输出形状参数#连接到

embedding_1_input (InputLayer) (None, 179) 0 embedding_1_input (InputLayer) (None, 179) 0

embedding_2_input (InputLayer) (None, 179) 0 embedding_2_input (InputLayer) (None, 179) 0

embedding_1 (Embedding) (None, 179, 300) 2729400 embedding_1_input[0][0] embedding_1 (Embedding) (None, 179, 300) 2729400 embedding_1_input[0][0]

embedding_2 (Embedding) (None, 179, 300) 2729400 embedding_2_input[0][0] embedding_2 (Embedding) (None, 179, 300) 2729400 embedding_2_input[0][0]

conv1d_1 (Conv1D) (None, 179, 100) 60100 embedding_1[0][0] conv1d_1 (Conv1D) (None, 179, 100) 60100 embedding_1[0][0]

conv1d_2 (Conv1D) (None, 179, 100) 180100 embedding_2[0][0] conv1d_2 (Conv1D) (None, 179, 100) 180100 embedding_2[0][0]

global_max_pooling1d_1 (GlobalM (None, 100) 0 conv1d_1[0][0] global_max_pooling1d_1 (GlobalM (None, 100) 0 conv1d_1[0][0]

global_max_pooling1d_2 (GlobalM (None, 100) 0 conv1d_2[0][0] global_max_pooling1d_2 (GlobalM (None, 100) 0 conv1d_2[0][0]

concatenate_1 (Concatenate) (None, 200) 0 global_max_pooling1d_1[0][0] concatenate_1 (Concatenate) (None, 200) 0 global_max_pooling1d_1[0][0]
global_max_pooling1d_2[0][0] global_max_pooling1d_2[0][0]

dropout_1 (Dropout) (None, 200) 0 concatenate_1[0][0] dropout_1 (Dropout) (None, 200) 0 concatenate_1[0][0]

dense_1 (Dense) (None, 270) 54270 dropout_1[0][0]密集_1（密集）（无，270）54270 dropout_1[0][0]

dropout_2 (Dropout) (None, 270) 0 dense_1[0][0] dropout_2（辍学）（无，270）0密集_1[0][0]

dense_2 (Dense) (None, 7) 1897 dropout_2[0][0]密集_2（密集）（无，7）1897 dropout_2[0][0]

activation_1 (Activation) (None, 7) 0 dense_2[0][0]激活_1（激活）（无，7）0密集_2[0][0]

Total params: 5,755,167 Trainable params: 5,755,167 Non-trainable params: 0总参数：5,755,167 可训练参数：5,755,167 不可训练参数：0

Training and Validation results for CNN CNN 的训练和验证结果

CNN confusion matrix CNN混淆矩阵

RNN Implementation RNN 实现

def make_model(lstm_units, dropout, hidden_units):

    model = Sequential()   

    model.add(Embedding(input_dim = input_dim,
                        output_dim   = output_dim,
                        weights      = [embedding_matrix],
                        input_length = max_len,
                        trainable    = False))

    model.add(LSTM(lstm_units))

    model.add(Dropout(dropout))

    if(hidden_units > 0):
        model.add(Dense(hidden_units, activation='elu'))
        model.add(Dropout(dropout))

    model.add(Dense(7,activation='softmax', kernel_initializer="uniform"))
    model.add(Activation('sigmoid'))

    model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

    return model

lstm_units = 120
dropout = 0.5
hidden_units = 550
callbacks = [tensorboard, early]
num_epochs = 20
batch_size = 60

model = make_model(lstm_units, dropout, hidden_units)
print(model.summary())
history = fit_model(model, num_epochs, batch_size, x_train, y_train)

Model: "sequential_6"型号：“sequential_6”

Layer (type) Output Shape Param #层（类型）输出形状参数#

embedding_6 (Embedding) (None, 179, 300) 2729400 embedding_6（嵌入）（无、179、300）2729400

lstm_8 (LSTM) (None, 120) 202080 lstm_8 (LSTM)（无，120）202080

dropout_5 (Dropout) (None, 120) 0 dropout_5（辍学）（无，120）0

dense_6 (Dense) (None, 550) 66550密集_6（密集）（无，550）66550

dropout_6 (Dropout) (None, 550) 0 dropout_6（辍学）（无，550）0

dense_7 (Dense) (None, 7) 3857密集_7（密集）（无，7）3857

activation_3 (Activation) (None, 7) 0 activation_3（激活）（无，7）0

Total params: 3,001,887 Trainable params: 272,487 Non-trainable params: 2,729,400总参数：3,001,887 可训练参数：272,487 不可训练参数：2,729,400

RNN training and validation scores RNN 训练和验证分数

RNN confusion matrix RNN 混淆矩阵

CNN-RNN implementation CNN-RNN 实现

def make_model(kernel_sizes, num_filters, dropout, hidden_units, lstm_units):

    submodels = []
    for kernel_size in kernel_sizes:
        submodel = Sequential()

        submodel.add(Embedding(input_dim = input_dim,
                            output_dim   = output_dim,
                            weights      = [embedding_matrix],
                            input_length = max_len,
                            trainable    = True))

        submodel.add(Conv1D(filters=num_filters, kernel_size=kernel_size, padding='same',activation='relu',strides=1))
        submodel.add(MaxPooling1D(pool_size=2, strides = 2))
        submodel.add(Dropout(dropout))
        submodel.add(LSTM(lstm_units)) 
        submodels.append(submodel)

    submodel_outputs = [model.output for model in submodels]    
    submodel_inputs = [model.input for model in submodels]

    merged = Concatenate(axis=1)(submodel_outputs)
    x = Dropout(dropout)(merged)

    if(hidden_units > 0):
        x = Dense(hidden_units, activation='relu')(x)
        x = Dropout(dropout)(x)

    x = Dense(7,activation='softmax', kernel_initializer="uniform")(x)
    out = Activation('sigmoid')(x)

    model = Model(submodel_inputs, out)
    model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

    return model

kernel_sizes  = [2,3,6]
num_filters   = 100
dropout       = 0.6
num_hidden    = 270
lstm_units = 80
callbacks     = [tensorboard, early]
num_epochs    = 20
batch_size = 64

model = make_model(kernel_sizes, num_filters, dropout, num_hidden, lstm_units)
print(model.summary())
history = fit_model(model, kernel_sizes, num_epochs, batch_size, x_train, y_train)

Model: "model_2"型号：“model_2”

Layer (type) Output Shape Param # Connected to层（类型）输出形状参数#连接到

embedding_8_input (InputLayer) (None, 179) 0 embedding_8_input (InputLayer) (None, 179) 0

embedding_9_input (InputLayer) (None, 179) 0 embedding_9_input (InputLayer) (None, 179) 0

embedding_10_input (InputLayer) (None, 179) 0 embedding_10_input (InputLayer) (None, 179) 0

embedding_8 (Embedding) (None, 179, 300) 2729400 embedding_8_input[0][0] embedding_8 (Embedding) (None, 179, 300) 2729400 embedding_8_input[0][0]

embedding_9 (Embedding) (None, 179, 300) 2729400 embedding_9_input[0][0] embedding_9 (Embedding) (None, 179, 300) 2729400 embedding_9_input[0][0]

embedding_10 (Embedding) (None, 179, 300) 2729400 embedding_10_input[0][0] embedding_10 (Embedding) (None, 179, 300) 2729400 embedding_10_input[0][0]

conv1d_8 (Conv1D) (None, 179, 100) 60100 embedding_8[0][0] conv1d_8 (Conv1D) (None, 179, 100) 60100 embedding_8[0][0]

conv1d_9 (Conv1D) (None, 179, 100) 90100 embedding_9[0][0] conv1d_9 (Conv1D) (None, 179, 100) 90100 embedding_9[0][0]

conv1d_10 (Conv1D) (None, 179, 100) 180100 embedding_10[0][0] conv1d_10 (Conv1D) (None, 179, 100) 180100 embedding_10[0][0]

max_pooling1d_7 (MaxPooling1D) (None, 89, 100) 0 conv1d_8[0][0] max_pooling1d_7 (MaxPooling1D) (None, 89, 100) 0 conv1d_8[0][0]

max_pooling1d_8 (MaxPooling1D) (None, 89, 100) 0 conv1d_9[0][0] max_pooling1d_8 (MaxPooling1D) (None, 89, 100) 0 conv1d_9[0][0]

max_pooling1d_9 (MaxPooling1D) (None, 89, 100) 0 conv1d_10[0][0] max_pooling1d_9 (MaxPooling1D) (None, 89, 100) 0 conv1d_10[0][0]

dropout_9 (Dropout) (None, 89, 100) 0 max_pooling1d_7[0][0] dropout_9 (Dropout) (None, 89, 100) 0 max_pooling1d_7[0][0]

dropout_10 (Dropout) (None, 89, 100) 0 max_pooling1d_8[0][0] dropout_10 (Dropout) (None, 89, 100) 0 max_pooling1d_8[0][0]

dropout_11 (Dropout) (None, 89, 100) 0 max_pooling1d_9[0][0] dropout_11 (Dropout) (None, 89, 100) 0 max_pooling1d_9[0][0]

lstm_2 (LSTM) (None, 80) 57920 dropout_9[0][0] lstm_2 (LSTM) (None, 80) 57920 dropout_9[0][0]

lstm_3 (LSTM) (None, 80) 57920 dropout_10[0][0] lstm_3 (LSTM) (None, 80) 57920 dropout_10[0][0]

lstm_4 (LSTM) (None, 80) 57920 dropout_11[0][0] lstm_4 (LSTM) (None, 80) 57920 dropout_11[0][0]

concatenate_3 (Concatenate) (None, 240) 0 lstm_2[0][0] concatenate_3 (Concatenate) (None, 240) 0 lstm_2[0][0]
lstm_3[0][0] lstm_3[0][0]
lstm_4[0][0] lstm_4[0][0]

dropout_12 (Dropout) (None, 240) 0 concatenate_3[0][0] dropout_12 (Dropout) (None, 240) 0 concatenate_3[0][0]

dense_3 (Dense) (None, 270) 65070 dropout_12[0][0]密集_3（密集）（无，270）65070 dropout_12[0][0]

dropout_13 (Dropout) (None, 270) 0 dense_3[0][0] dropout_13（辍学）（无，270）0密集_3[0][0]

dense_4 (Dense) (None, 7) 1897 dropout_13[0][0]密集_4（密集）（无，7）1897 dropout_13[0][0]

activation_2 (Activation) (None, 7) 0 dense_4[0][0]激活_2（激活）（无，7）0dense_4[0][0]

Total params: 8,759,227 Trainable params: 8,759,227 Non-trainable params: 0总参数：8,759,227 可训练参数：8,759,227 不可训练参数：0

CNN-RNN training and validation scores CNN-RNN confusion matrix CNN-RNN 训练和验证分数 CNN-RNN 混淆矩阵

I understand there is no magic formula to neural networks and no one size fits all approach, I am just looking for some guidance in the areas which I may have made mistakes in when implementing the CNN-RNN and RNN.我知道神经网络没有神奇的公式，也没有一种适合所有方法的方法，我只是在我在实施 CNN-RNN 和 RNN 时可能犯过错误的领域寻找一些指导。

Apologies in advance for any formatting errors as this is my first question asked.提前为任何格式错误道歉，因为这是我提出的第一个问题。 If there is any other info required please let me know.如果需要任何其他信息，请告诉我。

Thanks very much.非常感谢。

Answer 1

I can't say this will solve all your issues, but something that is definitely wrong is your repeated use of the sigmoid activation right after a softmax activation, while your classification problem has 7 classes.我不能说这会解决您的所有问题，但绝对错误的是您在 softmax 激活后立即重复使用 sigmoid 激活，而您的分类问题有 7 个类。 The sigmoid activation can only separate two classes. sigmoid 激活只能分离两个类。

For instance:例如：

model.add(Dense(7,activation='softmax', kernel_initializer="uniform"))
model.add(Activation('sigmoid'))

You should just remove the sigmoid activation the three times you did this.您应该只删除执行此操作的 3 次的 sigmoid 激活。

Answer 2

First of all, your CNN implementation is over-enthusiastic, have you come up with the architecture by experimenting with multiple designs or just chose it?首先，您的 CNN 实现过于热情，您是通过尝试多种设计来提出架构还是只是选择了它？

Usually, when multiple heads are chosen they are fed a slightly variation of the input, not the exact same copy so maybe your multi-head design is not the most optimal choice, it introduces too many unnecessary parameters and can lead to overfitting and from your loss curve which is evident. Usually, when multiple heads are chosen they are fed a slightly variation of the input, not the exact same copy so maybe your multi-head design is not the most optimal choice, it introduces too many unnecessary parameters and can lead to overfitting and from your损失曲线明显。

You used categorical crossentropy but used sigmoid after softmax, which is also not how things are done.您使用了分类交叉熵，但在 softmax 之后使用了 sigmoid，这也不是如何完成的。 Just use the softmax activation and get rid of the sigmoid.只需使用 softmax 激活并摆脱 sigmoid。

Is the confusion matrix for test set?是测试集的混淆矩阵吗？ Then, it seems your test split is too easy, as the model is so much overfitted it should perform poorly.然后，您的测试拆分似乎太简单了，因为模型过度拟合，应该表现不佳。 So, try to find a better test split by making sure not too much similar data falls in both training and testing.因此，通过确保在训练和测试中不会有太多相似的数据，尝试找到更好的测试拆分。

It's always better to fine-tune your simpler model before going to complicated ones.在使用复杂的模型之前，最好先对简单的模型进行微调。 As your LSTM model didn't perform well, it doesn't make sense to try a even more complicated model (CNN-LSTM).由于您的 LSTM 模型表现不佳，尝试更复杂的模型 (CNN-LSTM) 没有意义。 Your LSTM model didn't converge, the reasons can be many (the obvious ones being the incorrect usage of activation layer).您的 LSTM 模型没有收敛，原因可能有很多（明显的原因是激活层使用不当）。

def make_model(lstm_units, dropout, hidden_units):

    model = Sequential()   

    model.add(Embedding(input_dim = input_dim,
                        output_dim   = output_dim,
                        weights      = [embedding_matrix],
                        input_length = max_len,
                        trainable    = False))

    model.add(LSTM(lstm_units, return_sequences = True, recurrent_dropout = 0.2))
    model.add(Dropout(dropout))
    model.add(LSTM(lstm_units, recurrent_dropout = 0.2))

    model.add(Dropout(dropout))


    model.add(Dense(7, activation='softmax'))

    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['acc'])

    return model

Make it a fully LSTM based model by getting rid of the FC layers, also start with smaller LSTM units like 8, 16, 32, ...通过摆脱 FC 层使其成为一个完全基于 LSTM 的模型，也从更小的 LSTM 单元开始，如 8、16、32 ……

For more improvement, you can do the following.如需更多改进，您可以执行以下操作。

0) Get rid of the glove embedding and use your own learnable embedding. 0）摆脱手套嵌入并使用您自己的可学习嵌入。

1) Hyper-parameter search over the network to find the most optimal model. 1）通过网络进行超参数搜索以找到最佳模型。

There are many libraries, but I find this one very flexible.有很多库，但我发现这个非常灵活。 https://github.com/keras-team/keras-tuner https://github.com/keras-team/keras-tuner

Just install with pip.只需使用 pip 安装即可。

Here is a demo code.这是一个演示代码。

from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch


def build_model(hp):
    model = keras.Sequential()
    model.add(layers.Embedding(input_dim=hp.Int('input_dim',
                                        min_value=5000,
                                        max_value=10000,
                                        step = 1000),
                              output_dim=hp.Int('output_dim',
                                        min_value=200,
                                        max_value=800,
                                        step = 100),
                              input_length = 400))
    model.add(layers.Convolution1D(
                filters=hp.Int('filters',
                                        min_value=32,
                                        max_value=512,
                                        step = 32),
                kernel_size=hp.Int('kernel_size',
                                        min_value=3,
                                        max_value=11,
                                        step = 2),
                padding='same',
                activation='relu')),
    model.add(layers.BatchNormalization())
    model.add(layers.MaxPooling1D())
    model.add(layers.Flatten())
    model.add(layers.Dropout(0.4))
    model.add(layers.Dense(units=hp.Int('units',
                                        min_value=64,
                                        max_value=256,
                                        step=32),
                           activation='relu'))
    model.add(layers.Dropout(0.4))
    model.add(layers.Dense(7, activation='softmax'))
    model.compile(
    optimizer=keras.optimizers.Adam(
        hp.Choice('learning_rate',
                  values=[1e-2, 1e-3, 1e-4])),
    loss='categorical_crossentropy',
    metrics=['accuracy'])
    return model


tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
    directory='my_dir',
    project_name='helloworld')
tuner.search_space_summary()

## The following lines are based on your model


tuner.search(x, y,
             epochs=5,
             validation_data=(val_x, val_y))

models = tuner.get_best_models(num_models=2)

If you want to extract more meaningful features, one approach I found promising is by extracting pre-trained BERT features and then training using a CNN/LSTM.如果你想提取更有意义的特征，我发现一种很有前途的方法是提取预训练的 BERT 特征，然后使用 CNN/LSTM 进行训练。

A great repository to get started is this one - https://github.com/UKPLab/sentence-transformers这是一个很好的入门存储库 - https://github.com/UKPLab/sentence-transformers

Once you get the sentence embedding from the BERT/XLNet you can use those features to train another CNN similar to the one you are using except maybe get rid of the embedding layer as it's expensive.一旦您从 BERT/XLNet 获得句子嵌入，您就可以使用这些特征来训练另一个类似于您正在使用的 CNN 的 CNN，除非可能摆脱嵌入层，因为它很昂贵。

RNN 和 CNN-RNN 不会正确训练，总是预测一类

问题描述

Layer (type) Output Shape Param # Connected to层（类型）输出形状参数#连接到

activation_1 (Activation) (None, 7) 0 dense_2[0][0]激活_1（激活）（无，7）0密集_2[0][0]

Layer (type) Output Shape Param #层（类型）输出形状参数#

activation_3 (Activation) (None, 7) 0 activation_3（激活）（无，7）0

Layer (type) Output Shape Param # Connected to层（类型）输出形状参数#连接到

activation_2 (Activation) (None, 7) 0 dense_4[0][0]激活_2（激活）（无，7）0dense_4[0][0]

2 个解决方案

解决方案1
0 2020-03-24 17:21:49

解决方案2
0 已采纳 2020-03-25 07:33:38

RNN 和 CNN-RNN 不会正确训练，总是预测一类

问题描述

Layer (type) Output Shape Param # Connected to层（类型）输出形状参数#连接到

activation_1 (Activation) (None, 7) 0 dense_2[0][0]激活_1（激活）（无，7）0密集_2[0][0]

Layer (type) Output Shape Param #层（类型）输出形状参数#

activation_3 (Activation) (None, 7) 0 activation_3（激活）（无，7）0

Layer (type) Output Shape Param # Connected to层（类型）输出形状参数#连接到

activation_2 (Activation) (None, 7) 0 dense_4[0][0]激活_2（激活）（无，7）0dense_4[0][0]

2 个解决方案

解决方案1 0 2020-03-24 17:21:49

解决方案2 0 已采纳 2020-03-25 07:33:38

解决方案1
0 2020-03-24 17:21:49

解决方案2
0 已采纳 2020-03-25 07:33:38