StratifiedKFold 和過采樣一起使用

Question

我有一個機器學習 model 和一個包含 15 個關於乳腺癌的特征的數據集。 我想預測一個人的狀態（活着或死了）。 我有 85% 的存活病例，只有 15% 的病例死亡。 因此，我想使用過采樣來處理這個問題，並將其與分層 k 折結合起來。 我寫了這段代碼，它似乎運行良好，但我不知道我是否按正確的順序排列它們：

skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(x, y)

ros = RandomOverSampler(sampling_strategy="not majority") 
x_res, y_res = ros.fit_resample(x, y)

for train_index, test_index in skf.split(x_res,y_res):     
    x_train,x_test=x_res.iloc[train_index],x_res.iloc[test_index]
    y_train,y_test=y_res.iloc[train_index],y_res.iloc[test_index]

這樣做對嗎？ 或者我應該在分層 k 折之前應用過采樣嗎？

Answer 1

注意：拆分前的重采樣可能會導致數據泄漏，即訓練數據泄漏到測試數據中（請參閱 imblearn 文檔的常見陷阱部分）。

將步驟放入管道中，然后使用StratifiedKFold傳遞給cross_validate ：

from imblearn.pipeline import make_pipeline

model = make_pipeline(
    RandomOverSampler(sampling_strategy="not majority"),
    LogisticRegression(),
)

print(cross_validate(model, X, y, cv=StratifiedKFold())["test_score"].mean())

Answer 2

這樣做對嗎？ 或者我應該在分層 k 折之前應用過采樣嗎？

請注意，這正是您的代碼所做的：您在 k-fold split skf.split(x_res,y_res)之前應用過采樣ros.fit_resample(x, y) ) 。

您應該在 k 次拆分后應用過采樣。 如果您在拆分之前進行過采樣，則有可能在同一個拆分中的訓練和測試中都存在一些數據點（這稱為數據泄漏），這將導致過度擬合。

您的代碼的正確版本如下所示：

skf = StratifiedKFold(n_splits=10, random_state=None)
ros = RandomOverSampler(sampling_strategy="not majority")

for train_index, test_index in skf.split(x, y):     
    x_train_unsampled, x_test = x.iloc[train_index], x.iloc[test_index]
    y_train_unsampled, y_test = y.iloc[train_index], y.iloc[test_index]
    x_train, y_train = ros.fit_resample(x_train_unsampled, y_train_unsampled)

但是，我鼓勵您使用流水線和cross_validate而不是自己編寫所有樣板代碼，正如 Alexander 在他的回答中所建議的那樣。 這既可以節省您的時間和精力，也可以最大限度地降低引入錯誤的風險。

其他一些注意事項：

get_n_splits()除了返回您在之前的行中提供的分割數之外什么都不做。 它實際上並不對數據做任何事情。 您可以將其從代碼中刪除。
請注意，我僅對訓練池進行過采樣。 通常你只想對訓練池進行過度采樣。

StratifiedKFold 和過采樣一起使用

問題描述

2 個解決方案

解決方案1
0 2023-01-01 02:59:04

解決方案2
0 2023-01-01 14:56:09

StratifiedKFold 和過采樣一起使用

問題描述

2 個解決方案

解決方案1 0 2023-01-01 02:59:04

解決方案2 0 2023-01-01 14:56:09

解決方案1
0 2023-01-01 02:59:04

解決方案2
0 2023-01-01 14:56:09