简体   繁体   中英

Classification ML Model Training with Unbalanced Dataset

I am trying to do classification with machine learning. I have "good" and "bad" classes in my dataset.

Dataset shape: (248857, 12)

Due to some conditions, I am not able to collect more "good" class results, there are around 40k good, and 210k bad results. Is that an issue more with the models?

I trained the model in this way: (as an example I used here Naive Bayes but I use KNN, SVM, MLP, Random Forest, and Decision Tree as well)

X = df.drop(['Label'], axis=1)
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)
classifier = GaussianNB()  
classifier.fit(X_train, y_train)  
y_predNaive = classifier.predict(X_test)  
print(f'Test score {accuracy_score(y_predNaive,y_test)}')
plot_confusionmatrix(y_predNaive,y_test,dom='Test')
print('Classification Report for Naive Bayes\n\n', classification_report(y_test, y_predNaive))

There are multiple ways to deal with this issue. You can change the scoring metric to something like F-score or other metrics. Alternatively you can randomly remove 170k bad samples so your classes are equal. Furthermore random forests are pretty good at dealing with imbalanced datasets so you might be able to skip the preprocessing by just sticking with a random forest.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM