I'm trying to build a model of multiclass classification using imbalanced data with few samples(436) and 3 classes. After standardizing data I split it using stratifiedkfolds
to be sure that my minority class is represented well on the train and test split:
sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)#5/10/15/20
for train_index, test_index in sss.split(X, y):
#print("Train:", train_index, "Test:", test_index)
original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index]
original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]
I've read that feature selection and oversampling should be applied only to the training set, and that's exactly what I did.
#I did feature selection before this
smote = SMOTE('not majority')
X_sm, y_sm = smote.fit_resample(original_Xtrain, original_ytrain)
print(X_sm.shape, y_sm.shape)
Then I trained my model using the training set oversampled by SMOTE, at this point I want to use cross-validation
but I don't know should I use the stratified parameter used before or I can set my new cv to a new value like 5 splits or either use ShuffleSplit
for key, classifier in classifiers.items():
classifier.fit(X_sm, y_sm)
training_score1 = cross_val_score(classifier, X_sm, y_sm,scoring=make_scorer(f1_score, average='macro'),error_score="raise", cv=5)
print("Classifiers: ", classifier.__class__.__name__, "Has a training score of", round(training_score1.mean(), 2) * 100, "% F1 score")
training_score2 = cross_val_score(classifier, X_sm, y_sm,scoring=make_scorer(roc_auc_score, average='macro',multi_class='ovo', needs_proba=True),error_score="raise", cv=5)
print("Classifiers: ", classifier.__class__.__name__, "Has a training score of", round(training_score2.mean(), 2) * 100, "% Roc_auc score")
Well, I tried them both but using a new cv parameter gives me the best results. I think in the case of imbalanced data stratifiedkfolds
replace the test_train_split
and no need to use it again on the cross validation
but I'm not sure about this. Can you please tell me if I missed something in my process? am I doing it wrong?
This is your cross_validation strategy:
sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)#5/10/15/20
You don't need another one inside the loop.
using a new cv parameter gives me the best results.
It does because you are using oversampled data as test data inside this:
training_score1 = cross_val_score(classifier, X_sm, y_sm,scoring=make_scorer(f1_score, average='macro'),error_score="raise", cv=5)
But you already know you shouldn't do it
oversampling should be applied only to the training set
Let me know if I miss something.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.