Text classification CNN overfits training

Question

I am trying to use a CNN architecture to classify text sentences. The architecture of the network is as follows:

text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")

conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
pool1 = MaxPooling1D(pool_size=2)(drop21)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
drop22 = Dropout(0.5)(conv22)
pool2 = MaxPooling1D(pool_size=2)(drop22)
dense = Dense(16, activation='relu')(pool2)

flat = Flatten()(dense)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)

outputs = Dense(y_train.shape[1], activation='softmax')(out)

model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

I have some callbacks as early_stopping and reduceLR to stop the training and to reduce the learning rate when the validation loss is not improving (reducing).

early_stopping = EarlyStopping(monitor='val_loss', 
                               patience=5)
model_checkpoint = ModelCheckpoint(filepath=checkpoint_filepath,
                                   save_weights_only=False,
                                   monitor='val_loss',
                                   mode="auto",
                                   save_best_only=True)
learning_rate_decay = ReduceLROnPlateau(monitor='val_loss', 
                                        factor=0.1, 
                                        patience=2, 
                                        verbose=1, 
                                        mode='auto',
                                        min_delta=0.0001, 
                                        cooldown=0,
                                        min_lr=0)

Once the model is trained the history of the training goes as follows:

We can observe here that the validation loss is not improving from epoch 5 on and that the training loss is being overfitted with each step.

I will like to know if I'm doing something wrong in the architecture of the CNN? Aren't enough the dropout layers to avoid the overfitting? Which are other ways to reduce overfitting?

Any suggestion?

Thanks in advance.

Edit:

I have tried also with regularization an the result where even worse:

kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)

Edit 2:

I have tried to apply BatchNormalization layers after each convolution and the result is the next one:

norm = BatchNormalization()(conv2)

Edit 3:

After applying the LSTM architecture:

text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")

conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(drop21)
drop22 = Dropout(0.5)(conv22)

lstm1 = Bidirectional(LSTM(128, return_sequences = True))(drop22)
lstm2 = Bidirectional(LSTM(64, return_sequences = True))(lstm1)

flat = Flatten()(lstm2)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)

outputs = Dense(y_train.shape[1], activation='softmax')(out)

model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Answer 1

overfitting can caused by many factors, it happens when your model fits too well to the training set.

To handle it you can do some ways:

Add more data

Use data augmentation

Use architectures that generalize well

Add regularization (mostly dropout, L1/L2 regularization are also possible)

Reduce architecture complexity.

for more clearly you can read in https://towardsdatascience.com/deep-learning-3-more-on-cnns-handling-overfitting-2bd5d99abe5d

Answer 2

This is screaming Transfer Learning . google-unversal-sentence-encoder is perfect for this use case. Replace your model with

import tensorflow_hub as hub 
import tensorflow_text

text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")

# this next layer might need some tweaking dimension wise, to correctly fit
# X_train in the model
text_input = tf.keras.layers.Lambda(lambda x: tf.squeeze(x))(text_input)
# conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
# drop21 = Dropout(0.5)(conv2)
# pool1 = MaxPooling1D(pool_size=2)(drop21)
# conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
# drop22 = Dropout(0.5)(conv22)
# pool2 = MaxPooling1D(pool_size=2)(drop22)

# 1) you might need `text_input = tf.expand_dims(text_input, axis=0)` here
# 2) If you're classifying English only, you can use the link to the normal `google-universal-sentence-encoder`, not the multilingual one
# 3) both the English and multilingual have a `-large` version. More accurate but slower to train and infer. 
embedded = hub.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')(text_input) 

# this layer seems out of place, 
# dense = Dense(16, activation='relu')(embedded) 

# you don't need to flatten after a dense layer (in your case) or a backbone (in my case (google-universal-sentence-encoder))
# flat = Flatten()(dense)

dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)

outputs = Dense(y_train.shape[1], activation='softmax')(out)

model = Model(inputs=text_input, outputs=outputs)

Answer 3

I think since you are doing a text Classification, adding 1 or 2 LSTM layers might help the network learn better, since it will be able to better associate with the context of the data. I suggest adding the following code before the flatten layer.

lstm1 = Bidirectional(LSTM(128, return_sequence = True))
lstm2 = Bidirectional(LSTM(64))

LSTM layers can help neural network learn association between certain words and might improve the accuracy of your network.

I also Suggest dropping the Max Pooling layers as max pooling especially in text classification can lead the network to drop some of the useful features. Just keep the convolutional Layers and the dropout. Also remove the Dense layer before flatten and add the aforementioned LSTMs.

Answer 4

It is unclear how you feed the text into your model. I am assuming that you tokenize the text to represent it as a sequence of integers, but do you use any word embedding prior to feeding it into your model? If not, I suggest you to throw atrainable tensorflow Embedding layer at the start of your model. There is a clever technique called Embedding Lookup to speed up its training, but you can save it for later. Try adding this layer to your model. Then your Conv1D layer would have a much easier time working on a sequence of floats. Also, I suggest you throw BatchNormalization after each Conv1D , it should help to speed up convergence and training.

Text classification CNN overfits training

Question

4 answers

solution1
2 2020-06-15 15:04:46

solution2
1 2020-06-23 06:37:29

solution3
0 2020-06-22 20:25:52

solution4
0 2020-06-23 10:39:06

Text classification CNN overfits training

Question

4 answers

solution1 2 2020-06-15 15:04:46

solution2 1 2020-06-23 06:37:29

solution3 0 2020-06-22 20:25:52

solution4 0 2020-06-23 10:39:06

solution1
2 2020-06-15 15:04:46

solution2
1 2020-06-23 06:37:29

solution3
0 2020-06-22 20:25:52

solution4
0 2020-06-23 10:39:06