I am working on a prediction model with unbalanced data, ie my target variable has a distribution of 10%=1 and 90%=0.
In order to improve prediction performance, balancing (either oversampling or undersampling) is typically suggested.
I am wondering if I need to balance the entire dataset or only the training set. If rebalance the entire dataset, if I use oversampling, I am duplicating observations, which then means that observations from the training set will re-appear in the testing set, and thereby artifically improve prediction performance, right?
For the undersampling this should not matter I think.
Any thoughts?
You should balance your training data set, but there should be no need to balance test or validation sets. If your system is well-trained, then it will properly handle unbalanced data sets in test/validation time. If it doesn't, then it is not well-trained. Also, you want to assess real-world performance, and for that you need to test on real-world data.
If you decide to oversample, make sure to add a little bit of random noise to reduce the impact of duplicates.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.