Why is my neural network validation accuracy higher than my training accuracy and they both become constant?

Question

I have built a model and when I train it, my validation loss is smaller than my training one and the validation accuracy is higher than the training one. Is the model being overfitted? Am I doing something wrong? Can someone please look at my model and see if there is anything wrong with it? Thank you.

input_text = Input(shape=(200,), dtype='int32', name='input_text')
meta_input = Input(shape=(2,), name='meta_input')
embedding = Embedding(input_dim=len(tokenizer.word_index) + 1, 
                  output_dim=300, 
                  input_length=200)(input_text)

lstm = Bidirectional(LSTM(units=128, 
                      dropout=0.5, 
                      recurrent_dropout=0.5, 
                      return_sequences=True),
                 merge_mode='concat')(embedding)
pool = GlobalMaxPooling1D()(lstm)
dropout = Dropout(0.5)(pool)
text_output = Dense(n_codes, activation='sigmoid', name='aux_output')(dropout)

output = concatenate([text_output, meta_input])
output = Dense(n_codes, activation='relu')(output)

main_output = Dense(n_codes, activation='softmax', name='main_output')(output)

model = Model(inputs=[input_text,meta_input], outputs=[output])
optimer = Adam(lr=.001)
model.compile(optimizer='adam', 
          loss='binary_crossentropy', 
          metrics=['accuracy'])

model.summary()
model.fit([X1_train, X2_train], [y_train],
      validation_data=([X1_valid,X2_valid], [y_valid]),
      batch_size=64, epochs=20, verbose=1)

Here is the output:

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_text (InputLayer)         [(None, 200)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 200, 300)     889500      input_text[0][0]                 
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 200, 256)     439296      embedding[0][0]                  
__________________________________________________________________________________________________
global_max_pooling1d (GlobalMax (None, 256)          0           bidirectional[0][0]              
__________________________________________________________________________________________________
dropout (Dropout)               (None, 256)          0           global_max_pooling1d[0][0]       
__________________________________________________________________________________________________
aux_output (Dense)              (None, 545)          140065      dropout[0][0]                    
__________________________________________________________________________________________________
meta_input (InputLayer)         [(None, 2)]          0                                            
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 547)          0           aux_output[0][0]                 
                                                             meta_input[0][0]                 
__________________________________________________________________________________________________
dense (Dense)                   (None, 545)          298660      concatenate[0][0]                
==================================================================================================
Total params: 1,767,521
Trainable params: 1,767,521
Non-trainable params: 0
__________________________________________________________________________________________________
Train on 11416 samples, validate on 2035 samples
Epoch 1/20
11416/11416 [==============================] - 158s 14ms/sample - loss: 0.0955 - accuracy: 0.9929 - 
val_loss: 0.0559 - val_accuracy: 0.9964
Epoch 2/20
11416/11416 [==============================] - 152s 13ms/sample - loss: 0.0562 - accuracy: 0.9963 - 
val_loss: 0.0559 - val_accuracy: 0.9964
Epoch 3/20
11416/11416 [==============================] - 209s 18ms/sample - loss: 0.0562 - accuracy: 0.9963 - 
val_loss: 0.0559 - val_accuracy: 0.9964
Epoch 4/20
11416/11416 [==============================] - 178s 16ms/sample - loss: 0.0562 - accuracy: 0.9963 - 
val_loss: 0.0559 - val_accuracy: 0.9964
Epoch 5/20
11416/11416 [==============================] - 211s 18ms/sample - loss: 0.0562 - accuracy: 0.9963 - 
val_loss: 0.0559 - val_accuracy: 0.9964
Epoch 6/20

Answer 1

The difference is marginal so I would not worry. In general what might be happening is that by incident during the random splitting between train and validation sets the examples selected in the validation set are "easier" to guess than the ones in the training set.

You could overcome this by developing a cross validation strategy as following:

Take 10% of the dataset out (holdout) and consider it you test set.
With the remaining dataset make a 80%-20% split for training and validation sets.
Repeat the 80-20 tranining validation split 5 times.
Train 5 models on your 5 different train-valid datasets and see what the results are.
You can even compare all 5 models on the test sets just to see what would be the "real" or "closer to reality" accuracy. That might help to see which model generalizes better.

In the end you might even consider to stack them together: https://machinelearningmastery.com/stacking-ensemble-for-deep-learning-neural-networks/

The fact that both training and validation accuracy looks similar and do not change during the training indicates that the model might be stuck in a local minima. It is worth to train for more epochs (at least 20) to see if the model can "jump" out of the local minimal with the current learning rate.

If this not solve the problem I would change the learning rate from .001 to .0001 or .00001. This should help the model to converge hopefully to a global minimal.

If this does not solve the problem, there many other parameters/hyperparameters in general which might be useful to check further: number of nodes in the layers, number of layers, optimizer strategy, size and distribution (generality and variance) of the training set...

Answer 2

Overfitting would be when acc is higher than val_acc and loss lower than val_loss .

However, it looks for me that your validation dataset is not representative for the overall distribution in the dataset. For whatever reason the results of your validation dataset is constant and even constantly higher.

You are doing a binary classification. Be aware of class imbalance!

Eg if 99% of your sample is class 0 and 1% is class 1 , then, even if your model doesn't learn anything, it will have 99% accuracy if it always predicts 0 without ever once predicting a 1 . Imagine your (mostly random) split of data created a datset with 99.5% of the validation data will be class 0 and 0.5% class 1 . Imagine in worst case your model doesn't learn anything. And spits out ("predicts") always a 0 . Then train acc will be constantly 0.99 and a certain loss. And val_acc will be constantly 0.995 .

For me puzzling is that your performance measures are constant. That is ALWAYS bad. Because usually if the model learns sth and even if it overfits there will be stochastic noise always.

No book tells you the following - no beginner book. And I learned this by experience: You have to put shuffle=True in your model.fit() . Because for me it seems you are training in a way that you present the model first only samples of the one class and then the samples of another class. Mixing up samples of the one and the other class perturbs the model well enough and avoids it to get stuck in some local minima.

Or sometimes I got such constant results even when shuffling.

In that case, I just try to choose another random split which then works better. (So: try other splits!)

Answer 3

No, there is nothing wrong, this effect (validation metrics being better than training ones) is common with the use of Dropout, as your network uses.

Dropout adds noise during training, and this noise is not present during validation/testing, so its natural that training metrics get a bit worse, but validation metrics do not have this noise, and are a bit better due to the improved generalization produced by Dropout.

Why is my neural network validation accuracy higher than my training accuracy and they both become constant?

Question

3 answers

solution1
2 2020-03-01 12:42:44

solution2
2 2020-03-01 12:48:12

solution3
1 2020-03-01 14:11:07

Why is my neural network validation accuracy higher than my training accuracy and they both become constant?

Question

3 answers

solution1 2 2020-03-01 12:42:44

solution2 2 2020-03-01 12:48:12

solution3 1 2020-03-01 14:11:07

solution1
2 2020-03-01 12:42:44

solution2
2 2020-03-01 12:48:12

solution3
1 2020-03-01 14:11:07