ROC curve for Isolation Forest

Question

I am trying to plot the ROC curve to evaluate the accuracy of Isolation Forest for a Breast Cancer dataset. I calculated the True Positive rate (TPR) and False Positive Rate (FPR) from the confusion matrix. However, I do not understand how the TPR and FPR are in the form of matrices, instead of single integer values. And the ROC curve seems to work only with FPR and TPR in the form of matrices (I also tried to manually write the code for calculating FPR and TPR).

Are the TPR and FPR values always in the form of matrices?

Either way, my ROC curve comes out as a straight line. Why is it so?

Confusion Matrix :

from sklearn.metrics import confusion_matrix
cnf_matrix = confusion_matrix(y, y_pred_test1)

O/P :

>     [[  5  25]
>      [ 21 180]]

True Positive and False Positive : (Also, why are these values directly taken from the confusion matrix?)

F_P = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix)  
F_N = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
T_P = np.diag(cnf_matrix)
T_N = cnf_matrix.sum() - (FP + FN + TP)

F_P = F_P.astype(float)
F_N = F_N.astype(float)
T_P = T_P.astype(float)
T_N = T_N.astype(float)

O/P :

 False Positive [21. 25.] False Negative [25. 21.] True Positive [ 5. 180.] True Negative [180. 5.]

TPR and FPR :

tp_rate = TP/(TP+FN)
fp_rate = FP/(FP+TN)

O/P :

 TPR : [0.16666667 0.89552239] FPR [0.10447761 0.83333333]

ROC curve :

from sklearn import metrics
import matplotlib.pyplot as plt

plt.plot(fp_rate,tp_rate)
plt.show()

O/P :

Answer 1

The confusion matrix essentially gives you a single point on the ROC curve. To construct a 'full' ROC curve you will need a list of probabilities and then the ROC curve can be plotted by varying the 'threshold' used in determining the class prediction to determine which class each instance belongs to.

In your simple case (when you have only one point of the ROC curve) you could plot the ROC curve by extrapolating to the origin and the point (1,1):

# compare to your confusion matrix to see values.
TP = 180
FN = 21

tpr = TP/(TP+FN)
fpr = 1-tpr

tpr_line = [0, tpr, 1]
fpr_line = [0, fpr 1]

plt.plot(fpr, tpr, 'k-', lw=2)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.xlim(0, 1)
plt.ylim(0, 1)

and the ROC curve looks like:

Answer 2

The confusion_matrix() function gives you just the correctly/misclassified point but does not the provide the information about how far the model is confident when it misclassifies a datapoint.

This information is used to create an ROC curve (which is used to measure ability of a model to rank each datapoint based on its likelihood towards a particular class).

Instead, use the decision_function() or score_samples() functions to calculate the model's confidence that each data point is (or is not) an anomaly. Then, use roc_curve() to get the points necessary to plot the curve itself.

Here is an example for breast cancer dataset.

from sklearn.datasets import load_breast_cancer
X, y  = load_breast_cancer(return_X_y=True)

from sklearn.ensemble import IsolationForest

clf = IsolationForest(behaviour='new', max_samples=100,
                      random_state=0, contamination='auto')
clf.fit(X)
y_pred = clf.score_samples(X)

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y,y_pred)
import matplotlib.pyplot as plt
plt.plot(fpr, tpr, 'k-', lw=2)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

ROC curve for Isolation Forest

Question

2 answers

solution1
2 ACCPTED 2019-03-11 16:34:09

solution2
2 2019-03-11 17:19:44

ROC curve for Isolation Forest

Question

2 answers

solution1 2 ACCPTED 2019-03-11 16:34:09

solution2 2 2019-03-11 17:19:44

solution1
2 ACCPTED 2019-03-11 16:34:09

solution2
2 2019-03-11 17:19:44