Classification ML Model Training with Unbalanced Dataset

Question

I am trying to do classification with machine learning. I have "good" and "bad" classes in my dataset.

Dataset shape: (248857, 12)

Due to some conditions, I am not able to collect more "good" class results, there are around 40k good, and 210k bad results. Is that an issue more with the models?

I trained the model in this way: (as an example I used here Naive Bayes but I use KNN, SVM, MLP, Random Forest, and Decision Tree as well)

X = df.drop(['Label'], axis=1)
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)
classifier = GaussianNB()  
classifier.fit(X_train, y_train)  
y_predNaive = classifier.predict(X_test)  
print(f'Test score {accuracy_score(y_predNaive,y_test)}')
plot_confusionmatrix(y_predNaive,y_test,dom='Test')
print('Classification Report for Naive Bayes\n\n', classification_report(y_test, y_predNaive))

Answer 1

There are multiple ways to deal with this issue. You can change the scoring metric to something like F-score or other metrics. Alternatively you can randomly remove 170k bad samples so your classes are equal. Furthermore random forests are pretty good at dealing with imbalanced datasets so you might be able to skip the preprocessing by just sticking with a random forest.

Classification ML Model Training with Unbalanced Dataset

Question

1 answers

solution1
0 2022-06-26 18:23:55

Classification ML Model Training with Unbalanced Dataset

Question

1 answers

solution1 0 2022-06-26 18:23:55

solution1
0 2022-06-26 18:23:55