简体   繁体   English

如何解释Scikit-学习混淆矩阵

[英]How can I interpret Scikit-learn confusion matrix

I am using confusion matrix to check the performance of my classifier. 我正在使用混淆矩阵来检查分类器的性能。

I am using Scikit-Learn, I am little bit confused. 我正在使用Scikit-Learn,我有点困惑。 How can I interpret the result from 我如何解释结果

from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

How can I take the decision whether this predicted values are good or no. 我该如何判断这个预测值是好还是不好。

The simplest way to reach the decision whether the classifier is good or bad is just to calculate an error using some of standard error measures (for example the Mean squared error ). 确定分类器是好还是坏的最简单方法是使用一些标准误差度量(例如, 均方误差 )来计算误差 I imagine your example is copied from Scikit's documentation , so I assume you've read the definition. 我想您的示例是从Scikit的文档复制而来的,所以我假设您已经阅读了定义。

We have three classes here: 0 , 1 and 2 . 我们这里有三类: 012 On the diagonal, the confusion matrix tells you, how often a particular class have been predicted correctly. 在对角线上,混淆矩阵告诉您正确预测某个类的频率。 So from the diagonal 2 0 2 we can say that class with index 0 was classified correctly 2 times, class with index 1 was never predicted correctly, and class with index 2 was predicted correctly 2 times. 因此,从对角线2 0 2可以说,索引0类别正确分类了2次,索引1类别从未正确预测,索引2类别正确了2次。

Under and above the diagonal you have numbers which tell you how many times a class with index equal to the element's row number was classified as class with index equal to matrix's column. 在对角线的下方和上方,有一个数字,告诉您索引等于元素行号的类被分类为索引等于矩阵列的类的次数。 For example if you look at the first column, under the diagonal you have: 0 1 (in the lower left corner of the matrix). 例如,如果您查看第一列,则在对角线下方为: 0 1 (在矩阵的左下角)。 The lower 1 tells you that class with index 2 (the last row) was once erroneously classified as 0 (the first column). 较低的1告诉您索引2 (最后一行)的类曾经被错误地分类为0 (第一列)。 This corresponds to the fact that in your y_true there was one sample with label 2 and was classified as 0 . 这对应于以下事实:您的y_true有一个带有标签2样本,被分类为0 This happened for the first sample. 这是第一个样本。

If you sum all the numbers from the confusion matrix you get the number of testing samples ( 2 + 2 + 1 + 1 = 6 - equal to the length of y_true and y_pred ). 如果将混淆矩阵中的所有数字相加,则得到的测试样本数( 2 + 2 + 1 + 1 = 6等于y_truey_pred的长度)。 If you sum the rows you get the number of samples for each label: as you can verify, indeed there are two 0 s, one 1 and three 2 s in y_pred . 如果对行进行求和,则会得到每个标签的样本数:正如您可以验证的那样, y_pred确实有两个0 ,一个1和三个2

If you for example divide matrix elements by this number, you could tell that, for example, class with label 2 is recognized with correctly with ~66% accuracy, and in 1/3rd of cases it's confused (hence the name) with class with label 0 . 例如,如果将矩阵元素除以该数字,则可以说出,例如,正确识别标签2类的准确度约为66%,并且在1/3的情况下,它与使用的类混淆(因此命名)标签0

TL;DR: TL; DR:

While single-number error measures measure overall performance, with confusion matrix you can determine if (some examples): 虽然单数错误度量可以衡量整体性能,但可以使用混淆矩阵确定是否(某些示例):

  • your classifier just sucks with everything 您的分类器很烂

  • or it handles some classes well, and some not (this gives you a hint to look at this particular part of your data and observe classifier's behaviour for these cases) 或它可以很好地处理某些类,而某些类则不能(这提示您查看数据的这一特定部分,并观察这些情况下分类器的行为)

  • it does well, but confuses label A with B quite often. 效果不错,但经常将标签A与B混淆。 For example, for linear classifiers you may want to check then, if these classes are linearly separable. 例如,对于线性分类器,您可能要检查一下这些类是否可线性分离。

Etc. 等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM