[英]scikit-learn undersampling of unbalanced data for crossvalidation
How do I generate random folds for cross-validation in scikit-learn? 如何在scikit-learn中生成交叉验证的随机折叠?
Imagine we have 20 samples of one class, and 80 of the other, and we need to generate N train and test sets, each train set of the size 30, under the constraint that in each training set, the we have 50% of class one and 50% of class 2. 想象一下,我们有一个类的20个样本,另外80个样本,我们需要生成N个训练集和测试集,每个训练集大小为30,在每个训练集的约束条件下,我们有50%的类2级中的1%和50%。
I found this discussion ( https://github.com/scikit-learn/scikit-learn/issues/1362 ) but I don't understand how to get folds. 我发现了这个讨论( https://github.com/scikit-learn/scikit-learn/issues/1362 ),但我不明白如何获得折叠。 Ideally I think I need such a function:
理想情况下,我认为我需要这样一个功能:
cfolds = np.cross_validation.imaginaryfunction(
[list(itertools.repeat(1,20)), list(itertools.repeat(2,80))],
n_iter=100, test_size=0.70)
What am I missing? 我错过了什么?
There is no direct way to do crossvalidation with undersampling in scikit, but there are two workarounds: 在scikit中没有使用欠采样进行交叉验证的直接方法,但有两种解决方法:
1. 1。
Use StratifiedCrossValidation
to achieve cross validation with distribution in each fold mirroring the distribution of data, then you can achieve imbalance reduction in classifiers via the class_weight
param which can either take auto
and undersample/oversample classes inversely proportional to their count or you can pass a dictionary with explicit weights. 使用
StratifiedCrossValidation
实现交叉验证,并在每个折叠中分配镜像数据的分布,然后您可以通过class_weight
参数实现分类器的不平衡减少,这可以使auto
和欠采样/过采样类与其计数成反比,或者您可以传递字典明确的权重。
2. 2。
Write your own cross validation routine, which should be pretty straight forward using pandas . 编写自己的交叉验证例程,使用pandas应该非常简单。
StratifiedCV is a good choice but you can make it simpler: StratifiedCV是一个不错的选择,但你可以让它变得更简单:
That's all. 就这样。 Quick and workable!
快速可行!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.