简体   繁体   中英

StratifiedKfolds with imbalanced data for multiclass classification

I'm trying to build a model of multiclass classification using imbalanced data with few samples(436) and 3 classes. After standardizing data I split it using stratifiedkfolds to be sure that my minority class is represented well on the train and test split:

sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)#5/10/15/20
for train_index, test_index in sss.split(X, y):
    #print("Train:", train_index, "Test:", test_index)
    original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index]
    original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]

I've read that feature selection and oversampling should be applied only to the training set, and that's exactly what I did.

#I did feature selection before this
smote = SMOTE('not majority')
X_sm, y_sm = smote.fit_resample(original_Xtrain, original_ytrain)
print(X_sm.shape, y_sm.shape)

Then I trained my model using the training set oversampled by SMOTE, at this point I want to use cross-validation but I don't know should I use the stratified parameter used before or I can set my new cv to a new value like 5 splits or either use ShuffleSplit

for key, classifier in classifiers.items():
    classifier.fit(X_sm, y_sm)
    training_score1 = cross_val_score(classifier, X_sm, y_sm,scoring=make_scorer(f1_score, average='macro'),error_score="raise", cv=5)
    print("Classifiers: ", classifier.__class__.__name__, "Has a training score of", round(training_score1.mean(), 2) * 100, "% F1  score")
    training_score2 = cross_val_score(classifier, X_sm, y_sm,scoring=make_scorer(roc_auc_score, average='macro',multi_class='ovo', needs_proba=True),error_score="raise", cv=5)
    print("Classifiers: ", classifier.__class__.__name__, "Has a training score of", round(training_score2.mean(), 2) * 100, "% Roc_auc score")

Well, I tried them both but using a new cv parameter gives me the best results. I think in the case of imbalanced data stratifiedkfolds replace the test_train_split and no need to use it again on the cross validation but I'm not sure about this. Can you please tell me if I missed something in my process? am I doing it wrong?

This is your cross_validation strategy:

sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)#5/10/15/20

You don't need another one inside the loop.


using a new cv parameter gives me the best results.

It does because you are using oversampled data as test data inside this:

training_score1 = cross_val_score(classifier, X_sm, y_sm,scoring=make_scorer(f1_score, average='macro'),error_score="raise", cv=5)

But you already know you shouldn't do it

oversampling should be applied only to the training set

Let me know if I miss something.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM