Training loss higher than validation loss

Question

I am trying to train a regression model of a dummy function with 3 variables with fully connected neural nets in Keras and I always get a training loss much higher than the validation loss.

I split the data set in 2/3 for training and 1/3 for validation. I have tried lots of different things:

changing the architecture
adding more neurons
using regularization
using different batch sizes

Still the training error is one order of magnitude higer than the validation error:

Epoch 5995/6000
4020/4020 [==============================] - 0s 78us/step - loss: 1.2446e-04 - mean_squared_error: 1.2446e-04 - val_loss: 1.3953e-05 - val_mean_squared_error: 1.3953e-05
Epoch 5996/6000
4020/4020 [==============================] - 0s 98us/step - loss: 1.2549e-04 - mean_squared_error: 1.2549e-04 - val_loss: 1.5730e-05 - val_mean_squared_error: 1.5730e-05
Epoch 5997/6000
4020/4020 [==============================] - 0s 105us/step - loss: 1.2500e-04 - mean_squared_error: 1.2500e-04 - val_loss: 1.4372e-05 - val_mean_squared_error: 1.4372e-05
Epoch 5998/6000
4020/4020 [==============================] - 0s 96us/step - loss: 1.2500e-04 - mean_squared_error: 1.2500e-04 - val_loss: 1.4151e-05 - val_mean_squared_error: 1.4151e-05
Epoch 5999/6000
4020/4020 [==============================] - 0s 80us/step - loss: 1.2487e-04 - mean_squared_error: 1.2487e-04 - val_loss: 1.4342e-05 - val_mean_squared_error: 1.4342e-05
Epoch 6000/6000
4020/4020 [==============================] - 0s 79us/step - loss: 1.2494e-04 - mean_squared_error: 1.2494e-04 - val_loss: 1.4769e-05 - val_mean_squared_error: 1.4769e-05

This makes no sense, please help!

Edit: this is the full code

I have 6000 training examples

# -*- coding: utf-8 -*-
"""
Created on Mon Feb 26 13:40:03 2018

@author: Michele
"""
#from keras.datasets import reuters
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras import optimizers
import matplotlib.pyplot as plt
import os 
import pylab 
from keras.constraints import maxnorm
from sklearn.model_selection import train_test_split
from keras import regularizers
from sklearn.preprocessing import MinMaxScaler
import math
from sklearn.metrics import mean_squared_error
import keras

# fix random seed for reproducibility
seed=7
np.random.seed(seed)

dataset = np.loadtxt("BabbaX.csv", delimiter=",")
 #split into input (X) and output (Y) variables
#x = dataset.transpose()[:,10:15] #only use power
x = dataset
del(dataset) # delete container
dataset = np.loadtxt("BabbaY.csv", delimiter=",")
 #split into input (X) and output (Y) variables
y = dataset.transpose()
del(dataset) # delete container

 #scale labels from 0 to 1
scaler = MinMaxScaler(feature_range=(0, 1))
y = np.reshape(y, (y.shape[0],1))
y = scaler.fit_transform(y)

lenData=x.shape[0]
x=np.transpose(x)

xtrain=x[:,0:round(lenData*0.67)]
ytrain=y[0:round(lenData*0.67),]
xtest=x[:,round(lenData*0.67):round(lenData*1.0)]
ytest=y[round(lenData*0.67):round(lenData*1.0)]

xtrain=np.transpose(xtrain)
xtest=np.transpose(xtest)    

l2_lambda = 0.1 #reg factor

#sequential type of model
model = Sequential() 
#stacking layers with .add
units=300
#model.add(Dense(units, input_dim=xtest.shape[1], activation='relu', kernel_initializer='normal', kernel_regularizer=regularizers.l2(l2_lambda), kernel_constraint=maxnorm(3)))
model.add(Dense(units, activation='relu', input_dim=xtest.shape[1]))
#model.add(Dropout(0.1))
model.add(Dense(units, activation='relu'))
#model.add(Dropout(0.1))
model.add(Dense(1)) #no activation function should be used for the output layer

rms = optimizers.RMSprop(lr=0.00001, rho=0.9, epsilon=None, decay=0) #It is recommended to leave the parameters
adam = keras.optimizers.Adam(lr=0.00001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=1e-6, amsgrad=False)

#of this optimizer at their default values (except the learning rate, which can be freely tuned).
#adam=keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

#configure learning process with .compile
model.compile(optimizer=adam, loss='mean_squared_error', metrics=['mse'])

# fit the model (iterate on the training data in batches)
history = model.fit(xtrain, ytrain, nb_epoch=1000, batch_size=round(xtest.shape[0]/100),
              validation_data=(xtest, ytest), shuffle=True, verbose=2)

#extract weights for each layer
weights = [layer.get_weights() for layer in model.layers]

#evaluate on training data set
valuesTrain=model.predict(xtrain)

#evaluate on test data set
valuesTest=model.predict(xtest)

 #invert predictions
valuesTrain = scaler.inverse_transform(valuesTrain)
ytrain = scaler.inverse_transform(ytrain)
valuesTest = scaler.inverse_transform(valuesTest)
ytest = scaler.inverse_transform(ytest)

Answer 1

TL;DR : When a model is learning well and quickly the validation loss can be lower than the training loss, since the validation happens on the updated model, while the training loss did not have any (no batches) or only some (with batches) of the updates applied.

Okay I think I found out what's happening here. I used the following code to test this.

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt

np.random.seed(7)

N_DATA = 6000

x = np.random.uniform(-10, 10, (3, N_DATA))
y = x[0] + x[1]**2 + x[2]**3

xtrain = x[:, 0:round(N_DATA*0.67)]
ytrain = y[0:round(N_DATA*0.67)]

xtest = x[:, round(N_DATA*0.67):N_DATA]
ytest = y[round(N_DATA*0.67):N_DATA]

xtrain = np.transpose(xtrain)
xtest = np.transpose(xtest)

model = Sequential()
model.add(Dense(10, activation='relu', input_dim=3))
model.add(Dense(5, activation='relu'))
model.add(Dense(1))

adam = keras.optimizers.Adam()

# configure learning process with .compile
model.compile(optimizer=adam, loss='mean_squared_error', metrics=['mse'])

# fit the model (iterate on the training data in batches)
history = model.fit(xtrain, ytrain, nb_epoch=50,
                    batch_size=round(N_DATA/100),
                    validation_data=(xtest, ytest), shuffle=False, verbose=2)

plt.plot(history.history['mean_squared_error'])
plt.plot(history.history['val_loss'])
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

This is essentially the same as your code and replicates the problem, which is not actually a problem. Simply change

history = model.fit(xtrain, ytrain, nb_epoch=50,
                    batch_size=round(N_DATA/100),
                    validation_data=(xtest, ytest), shuffle=False, verbose=2)

to

history = model.fit(xtrain, ytrain, nb_epoch=50,
                    batch_size=round(N_DATA/100),
                    validation_data=(xtrain, ytrain), shuffle=False, verbose=2)

So instead of validating with your validation data you validate using the training data again, which leads to exactly the same behavior. Weird isn't it? No actually not. What I think is happening is:

The initial mean_squared_error given by Keras on every epoch is the loss before the gradients have been applied, while the validation happens after the gradients have been applied, which makes sense.

With highly stochastic problems for which NNs are usually used you do not see that, because the data varies so much that the updated weights simply are not good enough to describe the validation data, the slight overfitting effect on the training data is still so much stronger that even after updating the weights the validation loss is still higher than the training loss from before. That is only how I think it is though, I might be completely wrong.

Answer 2

One of the reasons that I think is maybe you can increase the size of training data and lower the size of validation data. Then your model will be trained on more samples which may include some complex samples as well and then can be validated on the remaining samples. Try something like train-80% and Validation-20% or any other numbers a little higher than what you used previously.

If you don't want to change the size of training and validation sets, then you can try changing the random seed value to some other number so that you will get a training set with different samples which might be helpful in training the model well.

Check this answer here to get more understanding of the other possible reasons.

Check this link if you want a more detailed explanation with an example. @Michele

Answer 3

If training loss is a little higher or nearer to validation loss, it mean that model is not overfitting. Efforts are always there to use best out of features to have less overfitting and better validation and test accuracies. Probable reason that you are always getting train loss higher can be the features and data you are using to train.

Please refer following link and observe the training and validation loss in case of dropout: http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

Training loss higher than validation loss

Question

3 answers

solution1
2 2018-05-17 12:57:57

solution2
0 2021-10-19 02:37:11

solution3
-2 2018-05-17 10:56:10

Training loss higher than validation loss

Question

3 answers

solution1 2 2018-05-17 12:57:57

solution2 0 2021-10-19 02:37:11

solution3 -2 2018-05-17 10:56:10

solution1
2 2018-05-17 12:57:57

solution2
0 2021-10-19 02:37:11

solution3
-2 2018-05-17 10:56:10