使用 RepeatedStratifiedKFold 5*10 的 cross_val_predict 概率

Question

我的目標是從 5*10 StratifiedKfold CV 計算 AUC、特異性、靈敏度和 95 % CI。 我還需要閾值為 0.4 的特異性和靈敏度以最大化靈敏度。

到目前為止，我能夠為 AUC 實現它。 下面的代碼：

seed = 42

# Grid Search
fit_intercept=[True, False]
C = [np.arange(1,41,1)]
penalty = ['l1', 'l2']

params = dict(C=C, fit_intercept = fit_intercept, penalty = penalty)
print(params)

 logreg = LogisticRegression(random_state=seed)
# instantiate the grid
logreg_grid = GridSearchCV(logreg, param_grid = params , cv=5, scoring='roc_auc',  iid='False')
# fit the grid with data
logreg_grid.fit(X_train, y_train)

logreg = logreg_grid.best_estimator_

cv = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 10, random_state = seed)


logreg_scores = cross_val_score(logreg, X_train, y_train, cv=cv, scoring='roc_auc')
print('LogReg:',logreg_scores.mean())


import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2, n-1)
    return m, m-h, m+h

mean_confidence_interval(logreg_scores, confidence=0.95)

Output: (0.7964761904761904, 0.7675441789148183, 0.8254082020375626)

到目前為止我真的很滿意，但是我怎樣才能實現這個概率，所以我可以計算 FPR、TPR 和閾值？ 對於一個簡單的 5 倍，我會這樣做：

def evaluate_threshold(threshold):
    print('Sensitivity(',threshold,'):', tpr[thresholds > threshold][-1])
    print('Specificity(',threshold,'):', 1 - fpr[thresholds > threshold][-1])

logreg_proba = cross_val_predict(logreg, X_train, y_train, cv=5, method='predict_proba')
fpr, tpr, thresholds = metrics.roc_curve(y_train, log_proba[:,1])
evaluate_threshold(0.5)
evaluate_threshold(0.4)

#Output would be: 
#Sensitivity( 0.5 ): 0.76
#Specificity( 0.5 ): 0.7096774193548387
#Sensitivity( 0.4 ): 0.88
#Specificity( 0.4 ): 0.6129032258064516

如果我用 5*10 CV 以這種方式嘗試：

cv = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 10, random_state = seed)    
y_pred = cross_val_predict(logreg, X_train, y_train, cv=cv, method='predict_proba')
fpr, tpr, thresholds = metrics.roc_curve(y_train, log_proba[:,1])
evaluate_threshold(0.5)
evaluate_threshold(0.4)

它拋出一個錯誤：

cross_val_predict only works for partitions

你能幫我解決這個問題嗎？

Answer 1

這就是我嘗試過的。

for i in range(10):
    cv = StratifiedKFold(n_splits = 5, random_state = i)   
    y_pred = cross_val_predict(logreg, X_train, y_train, cv=cv, method='predict_proba')
    fpr, tpr, thresholds = metrics.roc_curve(y_train, log_proba[:,1])
    evaluate_threshold(0.5)

Out: 
Sensitivity( 0.5 ): 0.84
Specificity( 0.5 ): 0.6451612903225806
Sensitivity( 0.5 ): 0.84
Specificity( 0.5 ): 0.6451612903225806
Sensitivity( 0.5 ): 0.84
Specificity( 0.5 ): 0.6451612903225806
and so on....

不幸的是，output 總是相同的，這不是我在使用 RepeatedStratifiedKFold 時所期望的。

也許有人可以給我一個建議？

使用 RepeatedStratifiedKFold 5*10 的 cross_val_predict 概率

問題描述

1 個解決方案

解決方案1
0 2020-05-19 07:07:51

使用 RepeatedStratifiedKFold 5*10 的 cross_val_predict 概率

問題描述

1 個解決方案

解決方案1 0 2020-05-19 07:07:51

解決方案1
0 2020-05-19 07:07:51