為什么 sklearn 的 permutation_test_score 返回的 ROC AUC 分數與我使用 predict_proba 和 roc_auc_score 手動計算時不同？

Question

當我嘗試使用 predict_proba 和 roc_auc_score 手動計算分數時，我無法重現從 permutation_test_score 獲得的 ROC AUC 分數。 這很重要，因為這可能是 sig./non-sig 之間的區別。 一個項目的結果。

輸出的視覺效果：

視覺的

（黃色）permutation_test_score 得分 = 0.5256
（綠色）使用 predict_proba = 0.5416 從 roc_auc_score 得分
（紅色）代表 p=.05 sig 的 97.5 個百分位線。 臨界點

這是產生該視覺效果的代碼，我改編自他們permutation_test_score 文檔中的 sklearn 示例。 grid.best_estimator_對象是一個 RandomForestClassifier，它是隨機網格搜索的結果，具有與您在下面看到的完全相同的交叉驗證 — 如果有幫助，我可以包含該代碼。 此外，如果直接從該 sklearn 示例中提取一個獨立的可重現示例會有所幫助，我也可以提供（不是為了空間）：

rskf = StratifiedKFold(n_splits=5)

n_permutations = 300
###### Use best estimator and run it on the Validation set. Validation set targets are permuted n times.
score_ofc, perm_scores_ofc, pvalue_ofc = permutation_test_score(
    grid.best_estimator_, 
    X_val, 
    y_val, 
    scoring="roc_auc", 
    cv=rskf, 
    n_permutations=n_permutations, 
    n_jobs=6, 
    random_state=42,
    verbose=1
)

###### manual calculation of roc_auc score
y_pred_val = grid.best_estimator_.predict_proba(X_val)[:,1]
roc_auc_val = roc_auc_score(y_val, y_pred_val)
p_val_man = (np.sum(perm_scores_ofc >= roc_auc_val) + 1.0) / (n_permutations + 1)


##### Plot permutations 
fig, ax = plt.subplots()
plt.figure(figsize = (5,5))

ax.hist(perm_scores_ofc, bins=20, density=True)
###### Compare roc_auc_val score to score_ofc 
ax.axvline(roc_auc_val, ls="--", color="g", lw=3)
ax.axvline(score_ofc, ls="--", color="y", lw=3)
###### Include line showing the p=.05 significance level
ax.axvline(np.percentile(perm_scores_ofc, 97.5), ls="-", color="r", lw=3)
ax.set_xlabel("ROC AUC score")
_ = ax.set_ylabel("Probability")     # copy-pasted all this code from the sklearn documentation, and I'm not sure why they called this "probability"

print('''
Green = Score on original data using "manual" predict_proba method
      = {}
p-val = {}

Yellow = Score on original data using "automatic" grid.score_ method
      = {}
p-val = {}

97.5 Percentile value: {}
'''.format(roc_auc_val, p_val_man, score_ofc, pvalue_ofc, np.percentile(perm_scores_ofc, 97.5)))

plt.show()

我已經看到了一個或兩個其他相關問題（例如： here ），這些問題與使用 decision_function 和 predict_proba 的記分器之間的區別有關，但這不應該是問題，因為 RandomForestClassifier 沒有 decision_function 屬性。 所以 permutation_test_score 必須使用 predict_proba，對嗎？ 但是為什么我會得到不同的結果呢？

感謝您的任何幫助！ 這幾天我一直在想辦法解決這個問題。

編輯

為了完整起見，包括我的原始管道和網格搜索代碼。

pca = PCA()
pipe = Pipeline(
    [
        ('scaler', MaxAbsScaler()),
        ('pca', pca),
        ('classifier', RandomForestClassifier()),
    ]
)

param_grid = [
    {
        'classifier': [RandomForestClassifier(random_state=42, n_jobs=-1)],
        'classifier__max_depth' : [i for i in range(1, 8, 2)],
        'scaler': [RobustScaler()],
        'pca__n_components': [33],
        'classifier__n_estimators' : [250],
        'classifier__criterion' : ['gini'],
        'classifier__max_features' : [0.3],
        'classifier__min_samples_split': [12],
        'classifier__min_samples_leaf': [9]
    }
]
###################################################
### USE GRID SEARCH TO FIND BEST HYPERPARAMETERS ###
# SCORING = ROC AUC

rskf = StratifiedKFold(n_splits=5)

grid = RandomizedSearchCV(pipe, 
                          param_grid,
                          n_iter=60,                           # Seemed like the right balance between computation time and exhaustiveness
                          random_state=42,
                          scoring='roc_auc',
                          cv=rskf,
                          refit=True,
                          return_train_score=True,
                          verbose=1,
                          n_jobs=6
                         ).fit(X_train, y_train)               # Fit all iterations on training data

Answer 1

permutation_test_score的score輸出是通過（重新）擬合估計器（ source ）獲得的，因此如果您沒有在隨機森林中設置隨機狀態，這可能與grid.best_estimator_中的模型不同。

為什么 sklearn 的 permutation_test_score 返回的 ROC AUC 分數與我使用 predict_proba 和 roc_auc_score 手動計算時不同？

問題描述

編輯

1 個解決方案

解決方案1
0 2022-07-15 14:50:04

為什么 sklearn 的 permutation_test_score 返回的 ROC AUC 分數與我使用 predict_proba 和 roc_auc_score 手動計算時不同？

問題描述

編輯

1 個解決方案

解決方案1 0 2022-07-15 14:50:04

解決方案1
0 2022-07-15 14:50:04