简体   繁体   中英

How to determine an overfitted model based on loss precision and recall

I've written an LSTM network with Keras (following code):

    df = pd.read_csv("../data/training_data.csv")

    # Group by and pivot the data
    group_index = df.groupby('group').cumcount()
    data = (df.set_index(['group', group_index])
            .unstack(fill_value=0).stack())

    # getting np array of the data and labeling
    # on the label group we take the first label because it is the same for all
    target = np.array(data['label'].groupby(level=0).apply(lambda x: [x.values[0]]).tolist())
    data = data.loc[:, data.columns != 'label']
    data = np.array(data.groupby(level=0).apply(lambda x: x.values.tolist()).tolist())

    # shuffel the training set
    data, target = shuffle(data, target)

    # spilt data to train and test
    x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=4)

    # ADAM Optimizer with learning rate decay
    opt = optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)

    # build the model
    model = Sequential()

    num_features = data.shape[2]
    num_samples = data.shape[1]

    model.add(LSTM(8, batch_input_shape=(None, num_samples, num_features), return_sequences=True, activation='sigmoid'))
    model.add(LeakyReLU(alpha=.001))
    model.add(Dropout(0.2))
    model.add(LSTM(4, return_sequences=True, activation='sigmoid'))
    model.add(LeakyReLU(alpha=.001))
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer=opt,
                  metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(),f1])

    model.summary()


    # Training, getting the results history for plotting
    history = model.fit(x_train, y_train, epochs=3000, validation_data=(x_test, y_test))

The monitored metrics are loss, accuracy, precision, recall and f1 score.

I've noticed that the validation loss metric start to climb around 300 epochs, so I've figured overfitting! however, recall is still climbing and precision is slightly improving.


在此输入图像描述 在此输入图像描述 在此输入图像描述


Why is that? is my model overfitted?

the validation loss metric start to climb around 300 epochs (...) recall is still climbing and precision is slightly improving. (...) Why is that?

Precision and recall are measures of how well your classifier performs in terms of the predicted class labels. Model loss on the other hand is a measure of the cross entropy , the error in classification probability :

![![![在此输入图片说明

where

y = predicted label
p = probability of predicted label

For example, the (softmax) outputs of the model for one observation may look like this for different epochs, say

# epoch 300
y = [0.1, 0.9] => argmax(y) => 1 (class label 1)
loss = -(1 * log(0.9)) = 0.10

# epoch 500
y = [0.4, 0.6] => argmax(y) => 1 (class label 1)
loss = -(1 * log(0.6)) = 0.51

In both cases the precision and recall metrics will stay unchanged (the class label is still predicted correctly), however the model loss has increased. In general terms, the model has become "less sure" about it's prediction, but it is still correct.

Note in your model the loss is calculated for all observations, not just a single one. I limit the discussion for simplicity. The loss formula is trivially expanded to n > 1 observations by taking the average of the loss of all observations.

is my model overfitted?

In order to determine this, you have to compare training loss and validation loss. You cannot tell by validation loss alone. If training loss decreases and validation loss increases, your model is overfitting.

Indeed, if the validation loss starts growing up again, then you may want to stop early. It's a "standard" approach, named "early stopping" ( https://en.wikipedia.org/wiki/Early_stopping ). Clearly, if the loss for your validation and data is increasing, then the model is not doing as great as it could, it is overfitting.

Precision and recall are not enough, they can increase if your model is giving more positive results, less negative ones (for instance 9 positives for 1 negative). Then these ratios can seem to be improved, but it's just that you have less true negatives.

All these two put together can help shed some light as to what is happening here. The good answers may still be good ones, but with a lower quality (the loss for individual samples increases on average, but still keeps good answers good), and there could be a shift from bad answers to good answers with a bias (true negatives are transformed into false positives).

AS @Matthieu mentioned, it could be biased to look at precision and recall of one class alone. May be we have to look at performance on other class as well.

Better measure could be concordance (auc of roc), in case of a binary classification. Concordance measure the goodness of the model to rank-order the datapoints based on its likeliness towards to a class.

One more option is Macro/Micro-precision/recall to get the complete picture of the model performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM