[英]Validation accuracy metrics reported by Keras model.fit log and Sklearn.metrics.confusion_matrix don't match each other
The problem is that the reported validation accuracy
value I get from Keras model.fit
history is significantly higher than the validation accuracy
metric I get from sklearn.metrics
functions. 问题是我从
model.fit
历史记录中获得的报告的validation accuracy
量值显着高于我从sklearn.metrics
函数中获得的validation accuracy
量值。
The results I get from model.fit
are summarized below: 我从
model.fit
得到的结果总结如下:
Last Validation Accuracy: 0.81
Best Validation Accuracy: 0.84
The results (normalized) from sklearn
are pretty different: sklearn
的结果(标准化)非常不同:
True Negatives: 0.78
True Positives: 0.77
Validation Accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.775
(see confusion matrix below for reference)
Edit: this calculation is incorrect, because one can not
use the normalized values to calculate the accuracy, since
it does not account for differences in the total absolute
number of points in the dataset. Thanks to the comment by desertnaut
Here is the graph of the validation accuracy data from model.fit history: 这是来自model.fit历史记录的验证准确性数据的图表:
And here is the Confusion matrix generated from sklearn: 这是从sklearn生成的混淆矩阵:
I think this question is somewhat similar as this one Sklearn metrics values are very different from Keras values But I've checked both methods are doing the validation on the same pool of data, so that answer is probably not adequate for my case. 我认为这个问题有点类似,因为这个Sklearn指标值与Keras值有很大不同,但是我已经检查了这两种方法都在同一数据池上进行验证,因此答案可能不足以满足我的情况。
Also, this question Keras binary accuracy metric gives too high accuracy seems to address some problems with the way that binary cross entropy affects a multiclass problem, but in my case it may not apply, since it is a true binary classification problem. 同样,这个问题提供了太高的精度的Keras二进制精度度量似乎解决了二进制交叉熵影响多类问题的方式的一些问题,但是在我的情况下,它可能不适用,因为它是一个真正的二进制分类问题。
Here are the commands used: 以下是使用的命令:
Model definition: 型号定义:
inputs = Input((Tx, ))
n_e = 30
embeddings = Embedding(n_x, n_e, input_length=Tx)(inputs)
out = Bidirectional(LSTM(32, recurrent_dropout=0.5, return_sequences=True))(embeddings)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5, return_sequences=True))(out)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5))(out)
out = Dense(3, activation='softmax')(out)
modelo = Model(inputs=inputs, outputs=out)
modelo.summary()
Model Summary: 型号摘要:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 100) 0
_________________________________________________________________
embedding (Embedding) (None, 100, 30) 86610
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 64) 16128
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 32) 10368
_________________________________________________________________
bidirectional_2 (Bidirection (None, 32) 6272
_________________________________________________________________
dense (Dense) (None, 3) 99
=================================================================
Total params: 119,477
Trainable params: 119,477
Non-trainable params: 0
_________________________________________________________________
Model compilation: 模型编译:
mymodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
Model fit call: 模型拟合调用:
num_epochs = 30
myhistory = mymodel.fit(X_pad, y, epochs=num_epochs, batch_size=50, validation_data=[X_val_pad, y_val_oh], shuffle=True, callbacks=callbacks_list)
Model fit log: 模型拟合日志:
Train on 505 samples, validate on 127 samples
Epoch 1/30
500/505 [============================>.] - ETA: 0s - loss: 0.6135 - acc: 0.6667
[...]
Epoch 10/30
500/505 [============================>.] - ETA: 0s - loss: 0.1403 - acc: 0.9633
Epoch 00010: val_acc improved from 0.77953 to 0.79528, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 41ms/sample - loss: 0.1393 - acc: 0.9637 - val_loss: 0.5203 - val_acc: 0.7953
Epoch 11/30
500/505 [============================>.] - ETA: 0s - loss: 0.0865 - acc: 0.9840
Epoch 00011: val_acc did not improve from 0.79528
505/505 [==============================] - 21s 41ms/sample - loss: 0.0860 - acc: 0.9842 - val_loss: 0.5257 - val_acc: 0.7953
Epoch 12/30
500/505 [============================>.] - ETA: 0s - loss: 0.0618 - acc: 0.9900
Epoch 00012: val_acc improved from 0.79528 to 0.81102, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0615 - acc: 0.9901 - val_loss: 0.5472 - val_acc: 0.8110
Epoch 13/30
500/505 [============================>.] - ETA: 0s - loss: 0.0415 - acc: 0.9940
Epoch 00013: val_acc improved from 0.81102 to 0.82152, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0413 - acc: 0.9941 - val_loss: 0.5853 - val_acc: 0.8215
Epoch 14/30
500/505 [============================>.] - ETA: 0s - loss: 0.0443 - acc: 0.9933
Epoch 00014: val_acc did not improve from 0.82152
505/505 [==============================] - 21s 42ms/sample - loss: 0.0453 - acc: 0.9921 - val_loss: 0.6043 - val_acc: 0.8136
Epoch 15/30
500/505 [============================>.] - ETA: 0s - loss: 0.0360 - acc: 0.9933
Epoch 00015: val_acc improved from 0.82152 to 0.84777, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0359 - acc: 0.9934 - val_loss: 0.5663 - val_acc: 0.8478
[...]
Epoch 30/30
500/505 [============================>.] - ETA: 0s - loss: 0.0039 - acc: 1.0000
Epoch 00030: val_acc did not improve from 0.84777
505/505 [==============================] - 20s 41ms/sample - loss: 0.0039 - acc: 1.0000 - val_loss: 0.8340 - val_acc: 0.8110
Confusion matrix from sklearn: sklearn的混淆矩阵:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_values, predicted_values)
The prediction values and gold values are determined as follows: 预测值和黄金值确定如下:
preds = mymodel.predict(X_val)
preds_ints = [[el] for el in np.argmax(preds, axis=1)]
values_pred = tokenizer_y.sequences_to_texts(preds_ints)
values_gold = tokenizer_y.sequences_to_texts(y_val)
Finally, I'd like to add that I have printed out the data and all prediction errors and I believe the sklearn values are more reliable, since they seem to match the results I get from printing out the predictions for the saved "best" model. 最后,我想补充一点,我已经打印出了数据和所有预测错误,并且我相信sklearn值更可靠,因为它们似乎与我从打印出保存的“最佳”模型的预测中获得的结果相匹配。
On the other hand, I can't understand how the metrics can be so different. 另一方面,我无法理解指标之间的差异。 Since they are both very well know softwares, I conclude I'm the one making the mistake here, but I can't pin down where or how.
由于它们都是众所周知的软件,因此我得出结论,我是在这里犯错的人,但是我无法确定在哪里或如何。
Your question is ill-posed; 您的问题不恰当; as already commented, you have not computed the actual accuracy of your scikit-learn model, hence you seem to compare apples with oranges.
如前所述,您尚未计算scikit-learn模型的实际准确性,因此您似乎将苹果与橙子进行了比较。 The computation (TP + TN)/2 from a normalized confusion matrix does not give the accuracy.
从归一化的混淆矩阵进行的计算(TP + TN)/ 2 没有给出准确性。 Here is a simple deomonstration using toy data, adapting the
plot_confusion_matrix
from the docs : 这是一个使用玩具数据的简单演示,改编了docs中的
plot_confusion_matrix
:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
# toy data
y_true = [0, 1, 0, 1, 0, 0, 0, 1]
y_pred = [1, 1, 1, 0, 1, 1, 0, 1]
class_names=[0,1]
# plot_confusion_matrix function
def plot_confusion_matrix(y_true, y_pred, classes,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
Computing the normalized confusion matrix gives: 计算归一化的混淆矩阵可得出:
plot_confusion_matrix(y_true, y_pred, classes=class_names, normalize=True)
# result:
Normalized confusion matrix
[[ 0.2 0.8 ]
[ 0.33333333 0.66666667]]
and according to your incorrect rationale, the accuracy should be: 并根据您的错误理由,准确性应为:
(0.67 + 0.2)/2
# 0.435
(Notice how in the normalized matrix the rows add to 100%, something that does not happen in the full confusion matrix) (请注意,在归一化矩阵中行是如何添加到100%的,这在完整的混淆矩阵中不会发生)
But let's now see what the real accuracy is from the un-normalized confusion matrix: 但是,现在让我们看看未归一化的混淆矩阵的真正准确性是什么:
plot_confusion_matrix(y_true, y_pred, classes=class_names) # normalize=False by default
# result
Confusion matrix, without normalization
[[1 4]
[1 2]]
from which, by the definition of accuracy as (TP + TN) / (TP + TN + FP + FN), we get: 从中,通过将精度定义为(TP + TN)/(TP + TN + FP + FN),我们得到:
(1+2)/(1+2+4+1)
# 0.375
Of course, we don't need the confusion matrix to get something so elementary as the accuracy; 当然,我们不需要混淆矩阵来获得诸如精度之类的基本知识。 as already advised in the comments, we can simply use the built-in
accuracy_score
method of scikit-learn: 如已经建议的意见,我们可以简单地使用内置
accuracy_score
的方法scikit学习:
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
# 0.375
which, rather unsurprisingly, agrees with our direct computation from the confusion matrix. 毫不奇怪,这与我们从混淆矩阵进行的直接计算相吻合。
Bottom line: 底线:
accuracy_score
) exist, it is definitely preferable to use them instead of ad hoc inspirations, especially when something does not look right (like a discrepancy between Keras and scikit-learn reported accuracies) accuracy_score
)的地方,绝对最好使用它们代替临时灵感, 尤其是当某些情况看起来不正确时(例如Keras和scikit-learn报告的准确性之间的差异)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.