简体   繁体   中英

Undersampling for Imbalanced Class in Python

I currently have an imbalanced dataset of over 800,000 datapoints. The imbalance is severe as there is only 3719 datapoints for one of the two classes. Upon undersampling the data using NearMiss algorithm in Python and applying a Random Forest classifier, I am able to achieve the following results:

  • Accuracy: 81.4%
  • Precision: 82.6%
  • Recall: 79.4%
  • Specificity: 83.4%

However, when re-testing this same model on the full dataset again, the confusion matrix results show a large bias towards the minority class for some reason, showing a large number of false positives. Is this the correct way of testing the model after undersampling?

Undersampling first from 800k records to 4k might be quite a loss in your domain knowledge. Most of the time you do over-sampling first and under-sampling second. There's dedicated package for that: imblearn . As for validation: you don't want to score resampled records, as it'll mess things up. Look closer into scoring params in sklearn , namely: micro , macro , weighted . Docs are here . There're also some specific metrics for this. Check it here:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM