简体   繁体   中英

How to use combination of over- and undersampling? with imbalanced learn

I want to resample some big data (class sizes: 8mio vs 2700) I would like to have 50.000 samples of each by oversampling class 2 und undersampling class 1. imblearn seems to offer a cominbation of over- and undersampling but i dont get how it works.

from collections import Counter
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=1)
X_resamp, y_resamp = smt.fit_resample(data_all[29000:30000], labels_all[29000:30000])

Before the data looked like

>>Counter(labels_all[29000:30000])
>>Counter({0: 968, 9: 32})

and afterwards

>>Counter(y_resamp)
>>Counter({0: 968, 9: 968})

as I would expect or wish for something like

>>Counter(y_resamp)
>>Counter({0: 100, 9: 100})

It seems you only have 32 records with class 9 , so it over sample that class and align it's data records with those of class 0 hence 9: 968

you are talking about reduce the data set to 100 record, you can do it be sampling 100 records randomly for each class, from X and Y (same 100 records) or take the first 100 like y_resamp[:100]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM