如何比較不同二元分類器的 ROC AUC 分數並評估 Python 中的統計顯着性？（p 值，置信區間）

Question

我想比較 Python 中不同的二進制分類器。 為此，我想計算 ROC AUC 分數，測量95% 置信區間 (CI)和p 值以獲取統計顯着性。

下面是 scikit-learn 中的一個最小示例，它在二進制分類數據集上訓練三個不同的模型，繪制 ROC 曲線並計算 AUC 分數。

以下是我的具體問題：

如何計算測試集上 ROC AUC 分數的95% 置信區間（CI） ？ （例如，使用自舉）。
如何比較 AUC 分數（在測試集上）並測量p 值以評估統計顯着性？ （零假設是模型沒有差異。拒絕零假設意味着 AUC 分數的差異具有統計學意義。）

.

import numpy as np

np.random.seed(2018)

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib
import matplotlib.pyplot as plt

data = load_breast_cancer()

X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=17)

# Naive Bayes Classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_prediction_proba = nb_clf.predict_proba(X_test)[:, 1]

# Ranodm Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=20)
rf_clf.fit(X_train, y_train)
rf_prediction_proba = rf_clf.predict_proba(X_test)[:, 1]

# Multi-layer Perceptron Classifier
mlp_clf = MLPClassifier(alpha=1, hidden_layer_sizes=150)
mlp_clf.fit(X_train, y_train)
mlp_prediction_proba = mlp_clf.predict_proba(X_test)[:, 1]


def roc_curve_and_score(y_test, pred_proba):
    fpr, tpr, _ = roc_curve(y_test.ravel(), pred_proba.ravel())
    roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel())
    return fpr, tpr, roc_auc


plt.figure(figsize=(8, 6))
matplotlib.rcParams.update({'font.size': 14})
plt.grid()
fpr, tpr, roc_auc = roc_curve_and_score(y_test, rf_prediction_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, nb_prediction_proba)
plt.plot(fpr, tpr, color='green', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, mlp_prediction_proba)
plt.plot(fpr, tpr, color='crimson', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.legend(loc="lower right")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
plt.ylabel('Sensitivity')
plt.show()

Answer 1

95% 置信區間的引導程序

您想對數據的多次重采樣重復您的分析。 在一般情況下，假設您有一個函數f(x)可以從數據x中確定您需要的任何統計信息，您可以像這樣引導：

def bootstrap(x, f, nsamples=1000):
    stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
    return np.percentile(stats, (2.5, 97.5))

這為您提供了 95% 置信區間的所謂插件估計（即您只需獲取引導分布的百分位數）。

在您的情況下，您可以像這樣編寫更具體的函數

def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    auc_values = []
    for b in range(nsamples):
        idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
        clf.fit(X_train[idx], y_train[idx])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
        auc_values.append(roc_auc)
    return np.percentile(auc_values, (2.5, 97.5))

在這里， clf是您要測試其性能的分類器，而X_train 、 y_train 、 X_test 、 y_test就像在您的代碼中一樣。

這給了我以下置信區間（四舍五入為三位數，1000 個引導樣本）：

朴素貝葉斯：0.986 [0.980 0.988]（估計、置信區間的下限和上限）
隨機森林：0.983 [0.974 0.989]
多層感知器：0.974 [0.223 0.98]

置換測試以測試機會性能

從技術上講，置換測試將檢查您的觀察序列的所有排列，並使用置換后的目標值評估您的 roc 曲線（特征未置換）。 如果您有一些觀察結果，這是可以的，但是如果您進行更多觀察，這將變得非常昂貴。 因此，對排列的數量進行二次抽樣並簡單地進行一些隨機排列是很常見的。 在這里，實現更多地取決於您要測試的特定事物。 以下函數為您的 roc_auc 值執行此操作

def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    idx1 = np.arange(X_train.shape[0])
    idx2 = np.arange(X_test.shape[0])
    auc_values = np.empty(nsamples)
    for b in range(nsamples):
        np.random.shuffle(idx1)  # Shuffles in-place
        np.random.shuffle(idx2)
        clf.fit(X_train, y_train[idx1])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
        auc_values[b] = roc_auc
    clf.fit(X_train, y_train)
    pred = clf.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
    return roc_auc, np.mean(auc_values >= roc_auc)

此函數再次將您的分類器作為clf並返回未打亂數據的 AUC 值和 p 值（即觀察到大於或等於未打亂數據中的 AUC 值的概率）。

使用 1000 個樣本運行此程序，所有三個分類器的 p 值都為 0。 請注意，由於抽樣，這些並不准確，但它們表明所有這些分類器的性能都比偶然性好。

分類器之間差異的置換檢驗

這要容易得多。 給定兩個分類器，您可以對每個觀察結果進行預測。 您只需像這樣打亂預測和分類器之間的分配

def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
    auc_differences = []
    auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
    auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
    observed_difference = auc1 - auc2
    for _ in range(nsamples):
        mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
        p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
        p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
        auc1 = roc_auc_score(y_test.ravel(), p1)
        auc2 = roc_auc_score(y_test.ravel(), p2)
        auc_differences.append(auc1 - auc2)
    return observed_difference, np.mean(auc_differences >= observed_difference)

通過這個測試和 1000 個樣本，我發現三個分類器之間沒有顯着差異：

朴素貝葉斯 vs 隨機森林：diff=0.0029, p(diff>)=0.311
朴素貝葉斯 vs MLP：diff=0.0117, p(diff>)=0.186
隨機森林與 MLP：diff=0.0088，p(diff>)=0.203

其中 diff 表示兩個分類器之間 roc 曲線的差異，p(diff>) 是在混洗數據集上觀察到較大差異的經驗概率。

Answer 2

可以使用下面給出的代碼來計算神經網絡的 AUC 和漸近正態分布的置信區間。

tf.contrib.metrics.auc_with_confidence_intervals(
labels,
predictions,
weights=None,
alpha=0.95,
logit_transformation=True,
metrics_collections=(),
updates_collections=(),
name=None)

如何比較不同二元分類器的 ROC AUC 分數並評估 Python 中的統計顯着性？（p 值，置信區間）

問題描述

2 個解決方案

解決方案1
14 已采納 2018-09-21 00:12:18

95% 置信區間的引導程序

置換測試以測試機會性能

分類器之間差異的置換檢驗

解決方案2
0 2018-11-24 22:13:41

如何比較不同二元分類器的 ROC AUC 分數並評估 Python 中的統計顯着性？ （p 值，置信區間）

問題描述

2 個解決方案

解決方案1 14 已采納 2018-09-21 00:12:18

95% 置信區間的引導程序

置換測試以測試機會性能

分類器之間差異的置換檢驗

解決方案2 0 2018-11-24 22:13:41

如何比較不同二元分類器的 ROC AUC 分數並評估 Python 中的統計顯着性？（p 值，置信區間）

解決方案1
14 已采納 2018-09-21 00:12:18

解決方案2
0 2018-11-24 22:13:41