简体   繁体   English

在scikit-learn中使用交叉验证时绘制Precision-Recall曲线

[英]Plotting Precision-Recall curve when using cross-validation in scikit-learn

I'm using cross-validation to evaluate the performance of a classifier with scikit-learn and I want to plot the Precision-Recall curve. 我正在使用交叉验证来评估带有scikit-learn的分类器的性能,我想绘制Precision-Recall曲线。 I found an example on scikit-learn`s website to plot the PR curve but it doesn't use cross validation for the evaluation. 我在scikit-learn网站上找到了一个绘制PR曲线的示例 ,但它没有使用交叉验证进行评估。

How can I plot the Precision-Recall curve in scikit learn when using cross-validation? 在使用交叉验证时,如何绘制scikit中的Precision-Recall曲线?

I did the following but i'm not sure if it's the correct way to do it (psudo code): 我做了以下但我不确定这是否是正确的方法(psudo代码):

for each k-fold:

   precision, recall, _ =  precision_recall_curve(y_test, probs)
   mean_precision += precision
   mean_recall += recall

mean_precision /= num_folds
mean_recall /= num_folds

plt.plot(recall, precision)

What do you think? 你怎么看?

Edit: 编辑:

it doesn't work because the size of precision and recall arrays are different after each fold. 它不起作用,因为每次折叠后precisionrecall阵列的大小不同。

anyone? 任何人?

Instead of recording the precision and recall values after each fold, store the predictions on the test samples after each fold. 而不是在每次折叠后记录精确度和召回值,而是在每次折叠后将预测存储在测试样本上。 Next, collect all the test (ie out-of-bag) predictions and compute precision and recall. 接下来, 收集所有测试(即袋外)预测并计算精度和召回率。

 ## let test_samples[k] = test samples for the kth fold (list of list)
 ## let train_samples[k] = test samples for the kth fold (list of list)

 for k in range(0, k):
      model = train(parameters, train_samples[k])
      predictions_fold[k] = predict(model, test_samples[k])

 # collect predictions
 predictions_combined = [p for preds in predictions_fold for p in preds]

 ## let predictions = rearranged predictions s.t. they are in the original order

 ## use predictions and labels to compute lists of TP, FP, FN
 ## use TP, FP, FN to compute precisions and recalls for one run of k-fold cross-validation

Under a single, complete run of k-fold cross-validation, the predictor makes one and only one prediction for each sample. 在单次,完整的k-fold交叉验证运行中,预测器对每个样本进行一次且仅一次预测。 Given n samples, you should have n test predictions. 给定n个样本,您应该有n个测试预测。

(Note: These predictions are different from training predictions, because the predictor makes the prediction for each sample without having been previously seen it.) (注意:这些预测与训练预测不同,因为预测器会对每个样本进行预测,而不会事先看到它。)

Unless you are using leave-one-out cross-validation , k-fold cross validation generally requires a random partitioning of the data. 除非您使用留一法交叉验证 ,否则k折交叉验证通常需要对数据进行随机分区。 Ideally, you would do repeated (and stratified ) k-fold cross validation. 理想情况下,您将进行重复 (和分层 )k折交叉验证。 Combining precision-recall curves from different rounds, however, is not straight forward, since you cannot use simple linear interpolation between precision-recall points, unlike ROC (See Davis and Goadrich 2006 ). 然而,组合来自不同轮次的精确回忆曲线并不是直截了当的,因为与ROC不同,您不能在精确回忆点之间使用简单的线性插值(参见Davis和Goadrich 2006 )。

I personally calculated AUC-PR using the Davis-Goadrich method for interpolation in PR space (followed by numerical integration) and compared the classifiers using the AUC-PR estimates from repeated stratified 10-fold cross validation. 我亲自使用Davis-Goadrich方法计算AUC-PR用于PR空间中的插值(随后进行数值积分),并使用来自重复分层10倍交叉验证的AUC-PR估计来比较分类器。

For a nice plot, I showed a representative PR curve from one of the cross-validation rounds. 对于一个不错的情节,我展示了一个交叉验证轮次的代表性PR曲线。

There are, of course, many other ways of assessing classifier performance, depending on the nature of your dataset. 当然,还有许多其他评估分类器性能的方法,具体取决于数据集的性质。

For instance, if the proportion of (binary) labels in your dataset is not skewed (ie it is roughly 50-50), you could use the simpler ROC analysis with cross-validation: 例如,如果数据集中(二进制)标签的比例没有偏差(即大约为50-50),则可以使用更简单的ROC分析和交叉验证:

Collect predictions from each fold and construct ROC curves (as before), collect all the TPR-FPR points (ie take the union of all TPR-FPR tuples), then plot the combined set of points with possible smoothing. 收集每个折叠的预测并构建ROC曲线(如前所述),收集所有TPR-FPR点(即采用所有TPR-FPR元组的并集),然后绘制可能平滑的组合点集。 Optionally, compute AUC-ROC using simple linear interpolation and the composite trapezoid method for numerical integration. 可选地,使用简单线性插值和用于数值积分的复合梯形方法计算AUC-ROC。

This is currently the best way to plot a Precision Recall curve for an sklearn classifier using cross-validation. 这是使用交叉验证绘制sklearn分类器的Precision Recall曲线的最佳方法。 Best part is, it plots the PR Curves for ALL classes, so you get multiple neat-looking curves as well 最好的部分是,它绘制了所有类的PR曲线,因此您也可以获得多条整齐的曲线

from scikitplot.classifiers import plot_precision_recall_curve
import matplotlib.pyplot as plt

clf = LogisticRegression()
plot_precision_recall_curve(clf, X, y)
plt.show()

The function automatically takes care of cross-validating the given dataset, concatenating all out of fold predictions, and calculating the PR Curves for each class + averaged PR Curve. 该功能自动负责交叉验证给定数据集,连接所有折叠预测,并计算每个类别的PR曲线+平均PR曲线。 It's a one-line function that takes care of it all for you. 它是一个单行功能,可以为您完成所有这些功能。

Precision Recall Curves 精确回忆曲线

Disclaimer: Note that this uses the scikit-plot library, which I built. 免责声明:请注意,这使用我构建的scikit-plot库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM