简体   繁体   中英

AUC calculation in decision tree in scikit-learn

Using scikit-learn with Python 2.7 on Windows, what is wrong with my code to calculate AUC? Thanks.

from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
#print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="precision")
#print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="recall")
print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="roc_auc")

Traceback (most recent call last):
  File "C:/Users/foo/PycharmProjects/CodeExercise/decisionTree.py", line 8, in <module>
    print cross_val_score(clf, iris.data, iris.target, cv=10, scoring="roc_auc")
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1433, in cross_val_score
    for train, test in cv)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
    self.results = batch()
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1550, in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1606, in _score
    score = scorer(estimator, X_test, y_test)
  File "C:\Python27\lib\site-packages\sklearn\metrics\scorer.py", line 159, in __call__
    raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported

Edit 1 , looks like scikit learn could even decide threshold without any machine learning models, wondering why,

import numpy as np
from sklearn.metrics import roc_curve
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
print fpr
print tpr
print thresholds

The roc_auc in sklearn works only with binary class:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

One way to move around this issue is to binarize your label and extend your classification to a one-vs-all scheme. In sklearn you can use sklearn.preprocessing.LabelBinarizer . The documentation is here:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html

Regarding 2nd part of your question posted under 'Edit 1':

  1. roc_curve function doesn't find optimum threshold for prediction
  2. roc_curve generates set of tpr and fpr by varying thresholds from 0 to 1 [given y_true and y_prob(probability of positive class)]
  3. In general, if roc_auc value is high, then your classifier is good. But you still need to find the optimum threshold that maximizes a metric such as F1 score when using the classifier for prediction
  4. In an ROC curve, the optimum threshold will correspond to a point on the ROC curve that is at maximum distance from the diagonal line(fpr = tpr line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM