简体   繁体   中英

Oversampling and undersampling and balanced data

I am working on a prediction model with unbalanced data, ie my target variable has a distribution of 10%=1 and 90%=0.

In order to improve prediction performance, balancing (either oversampling or undersampling) is typically suggested.

I am wondering if I need to balance the entire dataset or only the training set. If rebalance the entire dataset, if I use oversampling, I am duplicating observations, which then means that observations from the training set will re-appear in the testing set, and thereby artifically improve prediction performance, right?

For the undersampling this should not matter I think.

Any thoughts?

You should balance your training data set, but there should be no need to balance test or validation sets. If your system is well-trained, then it will properly handle unbalanced data sets in test/validation time. If it doesn't, then it is not well-trained. Also, you want to assess real-world performance, and for that you need to test on real-world data.

If you decide to oversample, make sure to add a little bit of random noise to reduce the impact of duplicates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM