简体繁体中英

Oversampling and undersampling and balanced data

原文 2018-04-20 09:28:50 0 1 python/ prediction/ balance

I am working on a prediction model with unbalanced data, ie my target variable has a distribution of 10%=1 and 90%=0.

In order to improve prediction performance, balancing (either oversampling or undersampling) is typically suggested.

I am wondering if I need to balance the entire dataset or only the training set. If rebalance the entire dataset, if I use oversampling, I am duplicating observations, which then means that observations from the training set will re-appear in the testing set, and thereby artifically improve prediction performance, right?

For the undersampling this should not matter I think.

Any thoughts?

1 answers

You should balance your training data set, but there should be no need to balance test or validation sets. If your system is well-trained, then it will properly handle unbalanced data sets in test/validation time. If it doesn't, then it is not well-trained. Also, you want to assess real-world performance, and for that you need to test on real-world data.

If you decide to oversample, make sure to add a little bit of random noise to reduce the impact of duplicates.

Imbalanced data: undersampling or oversampling?

Undersampling/Oversampling issues with onehotencoded categorical data

Which should I use, oversampling or undersampling?

imbalanced classification using undersampling and oversampling using pytorch python

How to add oversampling/undersampling procedure in scikit's Pipeline?

Undersampling with image data in python

Oversampling of image data for keras

Defore oversampling data showing 0

Pandas oversampling ragged sequential data

scikit-learn undersampling of unbalanced data for crossvalidation

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Imbalanced data: undersampling or oversampling? Undersampling/Oversampling issues with onehotencoded categorical data Which should I use, oversampling or undersampling? imbalanced classification using undersampling and oversampling using pytorch python How to add oversampling/undersampling procedure in scikit's Pipeline? Undersampling with image data in python Oversampling of image data for keras Defore oversampling data showing 0 Pandas oversampling ragged sequential data scikit-learn undersampling of unbalanced data for crossvalidation

Related Tags

Oversampling and undersampling and balanced data

Question

1 answers

solution1 3 2018-04-20 09:37:57

solution1
3 2018-04-20 09:37:57