scikit-学习不平衡数据的欠采样以进行交叉验证

Question

How do I generate random folds for cross-validation in scikit-learn? 如何在scikit-learn中生成交叉验证的随机折叠？

Imagine we have 20 samples of one class, and 80 of the other, and we need to generate N train and test sets, each train set of the size 30, under the constraint that in each training set, the we have 50% of class one and 50% of class 2. 想象一下，我们有一个类的20个样本，另外80个样本，我们需要生成N个训练集和测试集，每个训练集大小为30，在每个训练集的约束条件下，我们有50％的类2级中的1％和50％。

I found this discussion ( https://github.com/scikit-learn/scikit-learn/issues/1362 ) but I don't understand how to get folds. 我发现了这个讨论（ https://github.com/scikit-learn/scikit-learn/issues/1362 ），但我不明白如何获得折叠。 Ideally I think I need such a function: 理想情况下，我认为我需要这样一个功能：

cfolds = np.cross_validation.imaginaryfunction(
[list(itertools.repeat(1,20)), list(itertools.repeat(2,80))], 
n_iter=100, test_size=0.70)

What am I missing? 我错过了什么？

Answer 1

There is no direct way to do crossvalidation with undersampling in scikit, but there are two workarounds: 在scikit中没有使用欠采样进行交叉验证的直接方法，但有两种解决方法：

1. 1。

Use StratifiedCrossValidation to achieve cross validation with distribution in each fold mirroring the distribution of data, then you can achieve imbalance reduction in classifiers via the class_weight param which can either take auto and undersample/oversample classes inversely proportional to their count or you can pass a dictionary with explicit weights. 使用StratifiedCrossValidation实现交叉验证，并在每个折叠中分配镜像数据的分布，然后您可以通过class_weight参数实现分类器的不平衡减少，这可以使auto和欠采样/过采样类与其计数成反比，或者您可以传递字典明确的权重。

2. 2。

Write your own cross validation routine, which should be pretty straight forward using pandas . 编写自己的交叉验证例程，使用pandas应该非常简单。

Answer 2

StratifiedCV is a good choice but you can make it simpler: StratifiedCV是一个不错的选择，但你可以让它变得更简单：

Run random sampling on data related to class 1 (you need select 15/20 samples) 对与第1类相关的数据运行随机抽样（您需要选择15/20样本）
The same for class 2 (15/80) 第2类（15/80）也是如此
Repeat 100 times or how much you need. 重复100次或需要多少。

That's all. 就这样。 Quick and workable! 快速可行！

scikit-学习不平衡数据的欠采样以进行交叉验证

问题描述

2 个解决方案

解决方案1
1 2013-12-29 12:17:18

解决方案2
0 2018-01-10 10:43:52

scikit-学习不平衡数据的欠采样以进行交叉验证

问题描述

2 个解决方案

解决方案1 1 2013-12-29 12:17:18

解决方案2 0 2018-01-10 10:43:52

解决方案1
1 2013-12-29 12:17:18

解决方案2
0 2018-01-10 10:43:52