scikit-learn undersampling of unbalanced data for crossvalidation

Question

How do I generate random folds for cross-validation in scikit-learn?

Imagine we have 20 samples of one class, and 80 of the other, and we need to generate N train and test sets, each train set of the size 30, under the constraint that in each training set, the we have 50% of class one and 50% of class 2.

I found this discussion ( https://github.com/scikit-learn/scikit-learn/issues/1362 ) but I don't understand how to get folds. Ideally I think I need such a function:

cfolds = np.cross_validation.imaginaryfunction(
[list(itertools.repeat(1,20)), list(itertools.repeat(2,80))], 
n_iter=100, test_size=0.70)

What am I missing?

Answer 1

There is no direct way to do crossvalidation with undersampling in scikit, but there are two workarounds:

1.

Use StratifiedCrossValidation to achieve cross validation with distribution in each fold mirroring the distribution of data, then you can achieve imbalance reduction in classifiers via the class_weight param which can either take auto and undersample/oversample classes inversely proportional to their count or you can pass a dictionary with explicit weights.

2.

Write your own cross validation routine, which should be pretty straight forward using pandas .

Answer 2

StratifiedCV is a good choice but you can make it simpler:

Run random sampling on data related to class 1 (you need select 15/20 samples)
The same for class 2 (15/80)
Repeat 100 times or how much you need.

That's all. Quick and workable!

scikit-learn undersampling of unbalanced data for crossvalidation

Question

2 answers

solution1
1 2013-12-29 12:17:18

solution2
0 2018-01-10 10:43:52

scikit-learn undersampling of unbalanced data for crossvalidation

Question

2 answers

solution1 1 2013-12-29 12:17:18

solution2 0 2018-01-10 10:43:52

solution1
1 2013-12-29 12:17:18

solution2
0 2018-01-10 10:43:52