[英]Scikit: calculate precision and recall using cross_val_score function
I'm using scikit to perform a logistic regression on spam/ham data. 我正在使用scikit对垃圾邮件/火腿数据进行逻辑回归。 X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:
X_train是我的训练数据和y_train标签('垃圾邮件'或'火腿'),我训练我的LogisticRegression:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
If I want to get the accuracies for a 10 fold cross validation, I just write: 如果我想获得10倍交叉验证的准确度,我只想写:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way: 我认为通过这种方式简单地添加一个参数也可以计算精度和召回率:
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
But it results in a ValueError
: 但它会导致
ValueError
:
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')
Is it related to the data (should I binarize the labels ?) or do they change the cross_val_score
function ? 它与数据有关(我应该对标签进行二值化吗?)还是更改了
cross_val_score
函数?
Thank you in advance ! 先感谢您 !
To compute the recall and precision, the data has to be indeed binarized, this way: 要计算召回率和精度,数据必须确实是二值化的,这样:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy: 为了更进一步,当我想计算准确度时,我不必对数据进行二值化:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). 这只是因为准确度公式并不真正需要关于哪个类被认为是正面还是负面的信息:(TP + TN)/(TP + TN + FN + FP)。 We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.
我们确实可以看到TP和TN是可交换的,回忆,精度和f1都不是这样。
I encountered the same problem here, and I solved it with 我在这里遇到了同样的问题,我用它解决了
# precision, recall and F1
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y_train = np.array([number[0] for number in lb.fit_transform(y_train)])
recall = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('Recall', np.mean(recall), recall)
precision = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('Precision', np.mean(precision), precision)
f1 = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('F1', np.mean(f1), f1)
The syntax you showed above is correct. 您在上面显示的语法是正确的。 Looks like a problem with the data you're using.
看起来您正在使用的数据存在问题。 The labels don't need to be binarized, as long as they're not continuous numbers.
标签不需要二值化,只要它们不是连续数字。
You can prove out the same syntax with a different dataset: 您可以使用不同的数据集证明相同的语法:
iris = sklearn.dataset.load_iris()
X_train = iris['data']
y_train = iris['target']
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
print cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
You could use cross-validation like this to get the f1-score and recall : 您可以使用这样的交叉验证来获得f1分数并回忆:
print('10-fold cross validation:\n')
start_time = time()
scores = cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='f1')
recall_score=cross_validation.cross_val_score(clf, X,y, cv=10, scoring ='recall')
print(label+" f1: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'DecisionTreeClassifier'))
print("---Classifier %s use %s seconds ---" %('DecisionTreeClassifier', (time() - start_time)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.