简体   繁体   中英

sklearn.metrics.precision_recall_curve: Why are the precision and recall returned arrays instead of single values

I am calculating the precisions and recall for off-the-shelf algorithms on a dataset that I recently prepared.

It is a binary classification problem and I am looking to calculate the precision, recall and the f-scores for each of the classifier I built.

test_x, test_y, predics, pred_prob,score = CH.buildBinClassifier(data,allAttribs,0.3,50,'logistic')

The build classifier method basically builds a classifier, fits a training data and returns test_x(the features of the test data), test_y(the ground truth labels), predict(predictions made by the classifier), red_prob(prediction probabilities from the LogisiticRegression.predict_proba method).

Below is the code for calculating precision-recall:

from sklearn.metrics import precision_recall_curve

pr, re, _ = precision_recall_curve(test_y,pred_prob,pos_label=1)
pr
(array([ 0.49852507,  0.49704142,  0.49554896,  0.49702381,  0.49850746,
         0.5       ,  0.5015015 ,  0.50301205,  0.50453172,  0.50606061,
         . . . . . . . 
         0.875     ,  1.        ,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ,  1.        ,  1.        ])
re
array([ 1.        ,  0.99408284,  0.98816568,  0.98816568,  0.98816568,
         0.98816568,  0.98816568,  0.98816568,  0.98816568,  0.98816568,
         . . . . . . . 
         0.04142012,  0.04142012,  0.03550296,  0.0295858 ,  0.02366864,
         0.01775148,  0.01183432,  0.00591716,  0.        ]))

I do not understand why are precision and recall arrays? Shouldn't they be just single numbers?

Since precision is calculated as tpf/(tpf+fpf) and similarly recall as definition?

I am aware about calculating the average precision-recall by the following piece of code, but somehow seeing arrays instead of tpf, fpf, precision and recall is making me wonder what is going on.

from sklearn.metrics import precision_recall_fscore_support as prf

precision,recall,fscore,_ = prf(test_y,predics,pos_label=1,average='binary')

Edit: But without the average and pos_label parameter it reports the precisions for each of the class. Could someone explain the difference between the outputs of these two methods?

In a binary classification problem, pred_prob is the probability of the instance being in each of the classes, so actually the predicted value (classes) depends on this probability and one more value called threshold. All instances with pred_prob bigger than the threshold is classified into one class, and smaller into the other. The default threshold is 0.5.

So, varying the threshold we have different prediction results. In many problems a much better result may be obtained by adjusting the threshold. That's what gives you the precision_recall_curve.

From the sklearn documentation for precision_recall_curve:

Compute precision-recall pairs for different probability thresholds.

Classifier models like logistic regression do not actually output class labels (like "0" or "1"), they output probabilities (like 0.67). These probabilities tell you the likelihood that the input sample is of a particular class, like the positive ("1") class. But you still need to choose a probability threshold so that the algorithm can convert the probability (0.67) into a class ("1").

If you choose a threshold of 0.5, then all input samples with calculated probabilities greater than 0.5 will be assigned to the positive class. If you choose a different threshold and you get a different number of samples assigned to the positive and negative class, and therefore different precision and recall scores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM