简体   繁体   English

如何计算scikit-learn cross_val_predict准确度分数?

[英]How is scikit-learn cross_val_predict accuracy score calculated?

Does the cross_val_predict (see doc , v0.18) with k -fold method as shown in the code below calculate accuracy for each fold and average them finally or not? 使用如下代码中所示的k -fold方法的cross_val_predict (参见doc ,v0.18)是否计算每次折叠的准确度并最终平均它们?

cv = KFold(len(labels), n_folds=20)
clf = SVC()
ypred = cross_val_predict(clf, td, labels, cv=cv)
accuracy = accuracy_score(labels, ypred)
print accuracy

No, it does not! 不,不是的!

According to cross validation doc page, cross_val_predict does not return any scores but only the labels based on a certain strategy which is described here: 根据交叉验证文档页面, cross_val_predict不会返回任何分数,只会返回基于此处描述的特定策略的标签:

The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set . 函数cross_val_predict具有与cross_val_score类似的接口, 但是对于输入中的每个元素,返回当它在测试集中时为该元素获得的预测 Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised). 只能使用将所有元素分配给测试集一次的交叉验证策略(否则会引发异常)。

And therefore by calling accuracy_score(labels, ypred) you are just calculating accuracy scores of labels predicted by aforementioned particular strategy compared to the true labels. 因此,通过调用accuracy_score(labels, ypred) 您只需计算上述特定策略与真实标签相比预测的标签的准确度分数 This again is specified in the same documentation page: 这再次在同一文档页面中指定:

These prediction can then be used to evaluate the classifier: 然后可以使用这些预测来评估分类器:

 predicted = cross_val_predict(clf, iris.data, iris.target, cv=10) metrics.accuracy_score(iris.target, predicted) 

Note that the result of this computation may be slightly different from those obtained using cross_val_score as the elements are grouped in different ways. 注意,该计算的结果可能与使用cross_val_score获得的结果略有不同,因为元素以不同方式分组。

If you need accuracy scores of different folds you should try: 如果您需要不同折叠的准确度分数,您应该尝试:

>>> scores = cross_val_score(clf, X, y, cv=cv)
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

and then for the mean accuracy of all folds use scores.mean() : 然后对于所有折叠的平均准确度使用scores.mean()

>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)

How to calculate Cohen kappa coefficient and confusion matrix for each fold? 如何计算每个折叠的Cohen kappa系数和混淆矩阵?

For calculating Cohen Kappa coefficient and confusion matrix I assumed you mean kappa coefficient and confusion matrix between true labels and each fold's predicted labels: 为了计算Cohen Kappa coefficient和混淆矩阵,我假设你的意思是真实标签和每个折叠的预测标签之间的kappa系数和混淆矩阵:

from sklearn.model_selection import KFold
from sklearn.svm.classes import SVC
from sklearn.metrics.classification import cohen_kappa_score
from sklearn.metrics import confusion_matrix

cv = KFold(len(labels), n_folds=20)
clf = SVC()
for train_index, test_index in cv.split(X):
    clf.fit(X[train_index], labels[train_index])
    ypred = clf.predict(X[test_index])
    kappa_score = cohen_kappa_score(labels[test_index], ypred)
    confusion_matrix = confusion_matrix(labels[test_index], ypred)

What does cross_val_predict return? cross_val_predict返回什么?

It uses KFold to split the data to k parts and then for i=1..k iterations: 它使用KFold将数据拆分为k部分,然后进行i=1..k迭代:

  • takes i'th part as the test data and all other parts as training data 采用i'th部分作为测试数据和其他所有部分作为训练数据
  • trains the model with training data (all parts except i'th ) 培训与训练数据模型(除所有部分i'th
  • then by using this trained model, predicts labels for i'th part (test data) 然后通过使用该训练的模型,预测标签i'th部分(测试数据)

In each iteration, label of i'th part of data gets predicted. 在每次迭代中,预测了i'th部分数据的标签。 In the end cross_val_predict merges all partially predicted labels and returns them as the final result. 最后,cross_val_predict合并所有部分预测的标签并将它们作为最终结果返回。

This code shows this process step by step: 此代码逐步显示此过程:

X = np.array([[0], [1], [2], [3], [4], [5]])
labels = np.array(['a', 'a', 'a', 'b', 'b', 'b'])

cv = KFold(len(labels), n_folds=3)
clf = SVC()
ypred_all = np.chararray((labels.shape))
i = 1
for train_index, test_index in cv.split(X):
    print("iteration", i, ":")
    print("train indices:", train_index)
    print("train data:", X[train_index])
    print("test indices:", test_index)
    print("test data:", X[test_index])
    clf.fit(X[train_index], labels[train_index])
    ypred = clf.predict(X[test_index])
    print("predicted labels for data of indices", test_index, "are:", ypred)
    ypred_all[test_index] = ypred
    print("merged predicted labels:", ypred_all)
    i = i+1
    print("=====================================")
y_cross_val_predict = cross_val_predict(clf, X, labels, cv=cv)
print("predicted labels by cross_val_predict:", y_cross_val_predict)

The result is: 结果是:

iteration 1 :
train indices: [2 3 4 5]
train data: [[2] [3] [4] [5]]
test indices: [0 1]
test data: [[0] [1]]
predicted labels for data of indices [0 1] are: ['b' 'b']
merged predicted labels: ['b' 'b' '' '' '' '']
=====================================
iteration 2 :
train indices: [0 1 4 5]
train data: [[0] [1] [4] [5]]
test indices: [2 3]
test data: [[2] [3]]
predicted labels for data of indices [2 3] are: ['a' 'b']
merged predicted labels: ['b' 'b' 'a' 'b' '' '']
=====================================
iteration 3 :
train indices: [0 1 2 3]
train data: [[0] [1] [2] [3]]
test indices: [4 5]
test data: [[4] [5]]
predicted labels for data of indices [4 5] are: ['a' 'a']
merged predicted labels: ['b' 'b' 'a' 'b' 'a' 'a']
=====================================
predicted labels by cross_val_predict: ['b' 'b' 'a' 'b' 'a' 'a']

As you can see from the code of cross_val_predict on github , the function computes for each fold the predictions and concatenates them. 正如你可以从github上的cross_val_predict代码中看到的cross_val_predict ,该函数计算预测的每个折叠并将它们连接起来。 The predictions are made based on model learned from other folds. 预测是基于从其他折叠中学习的模型进行的。

Here is a combination of your code and the example provided in the code 以下是代码与代码中提供的示例的组合

from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score

diabetes = datasets.load_diabetes()
X = diabetes.data[:400]
y = diabetes.target[:400]
cv = KFold(n_splits=20)
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=cv)
accuracy = accuracy_score(y_pred.astype(int), y.astype(int))

print(accuracy)
# >>> 0.0075

Finally, to answer your question: "No, the accuracy is not averaged for each fold" 最后,回答你的问题: “不,每次折叠的准确性不是平均的”

As it is written in the documenattion sklearn.model_selection.cross_val_predict : 正如它在文档sklearn.model_selection.cross_val_predict中所写

It is not appropriate to pass these predictions into an evaluation metric. 将这些预测传递给评估指标是不合适的。 Use cross_validate to measure generalization error. 使用cross_validate来度量泛化错误。

I would like to add an option for a quick and easy answer, above what the previous developers contributed. 我想添加一个快速简单的答案选项,高于之前开发人员的贡献。

If you take micro average of F1 you will essentially be getting the accuracy rate. 如果你取F1的微观平均值,你将获得准确率。 So for example that would be: 例如,那将是:

from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import precision_recall_fscore_support as score    

y_pred = cross_val_predict(lm,df,y,cv=5)
precision, recall, fscore, support = score(y, y_pred, average='micro') 
print(fscore)

This works mathematically, since the micro average gives you the weighted average of the confusion matrix. 这在数学上是有效的,因为微观平均值给出了混淆矩阵的加权平均值。

Good luck. 祝好运。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 cross_val_predict 与 cross_val_score 时,scikit-learn 分数不同 - scikit-learn scores are different when using cross_val_predict vs cross_val_score scikit-learn:cross_val_predict 仅适用于分区 - scikit-learn: cross_val_predict only works for partitions 如何在scikit-learn中获取与predict_proba一起使用的cross_val_predict中的类标签 - How to get classes labels from cross_val_predict used with predict_proba in scikit-learn Scikit-learn cross val得分:数组的索引太多了 - Scikit-learn cross val score: too many indices for array cross_val_score 和 cross_val_predict 的区别 - Difference between cross_val_score and cross_val_predict 机器学习 cross_val_score 与 cross_val_predict - MachineLearning cross_val_score vs cross_val_predict cross_val_score,cross_val_predict和cross_val_validate如何进行培训,测试和验证? - How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, testing and validation? scikit-learn GridSearchCV如何计算best_score_? - How is scikit-learn GridSearchCV best_score_ calculated? 交叉验证:来自scikit-learn参数的cross_val_score函数 - Cross validation: cross_val_score function from scikit-learn arguments 用cross_val_score计算的指标与从cross_val_predict开始计算的相同指标有何不同? - How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM