简体繁体中英

Undersampling for Imbalanced Class in Python

原文 2019-11-14 23:39:42 2 1 python/ machine-learning/ downsampling

I currently have an imbalanced dataset of over 800,000 datapoints. The imbalance is severe as there is only 3719 datapoints for one of the two classes. Upon undersampling the data using NearMiss algorithm in Python and applying a Random Forest classifier, I am able to achieve the following results:

Accuracy: 81.4%
Precision: 82.6%
Recall: 79.4%
Specificity: 83.4%

However, when re-testing this same model on the full dataset again, the confusion matrix results show a large bias towards the minority class for some reason, showing a large number of false positives. Is this the correct way of testing the model after undersampling?

1 answers

Undersampling first from 800k records to 4k might be quite a loss in your domain knowledge. Most of the time you do over-sampling first and under-sampling second. There's dedicated package for that: imblearn . As for validation: you don't want to score resampled records, as it'll mess things up. Look closer into scoring params in sklearn , namely: micro , macro , weighted . Docs are here . There're also some specific metrics for this. Check it here:

Imbalanced data: undersampling or oversampling?

Undersampling with image data in python

Array undersampling Python

How to use combination of over- and undersampling? with imbalanced learn

Class weights on imbalanced CNN

how to deal with imbalanced class?

How to undersampling the majority class using pyspark

Imbalanced Dataset - Binary Classification Python

Is it feasible to have the training set < the test set after undersampling the majority class?

class_weight for imbalanced data - Keras

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Imbalanced data: undersampling or oversampling? Undersampling with image data in python Array undersampling Python How to use combination of over- and undersampling? with imbalanced learn Class weights on imbalanced CNN how to deal with imbalanced class? How to undersampling the majority class using pyspark Imbalanced Dataset - Binary Classification Python Is it feasible to have the training set < the test set after undersampling the majority class? class_weight for imbalanced data - Keras

Related Tags

Undersampling for Imbalanced Class in Python

Question

1 answers

solution1 0 ACCPTED 2019-11-15 11:10:45

solution1
0 ACCPTED 2019-11-15 11:10:45