Scikit-Learn中的分层标记K-fold交叉验证

Question

I'm trying to classify instances of a dataset as being in one of two classes, a or b. 我正在尝试将数据集的实例分类为两个类中的一个，a或b。 B is a minority class and only makes up 8% of the dataset. B是少数类，只占数据集的8％。 All instances are assigned an id indicating which subject generated the data. 为所有实例分配一个id，指示哪个主题生成了数据。 Because every subject generated multiple instances id's are repeated frequently in the dataset. 因为每个主题生成多个实例，所以id在数据集中频繁重复。

The table below is just an example, the real table has about 100,000 instances. 下表只是一个例子，真实表有大约100,000个实例。 Each subject id has about 100 instances in the table. 每个主题id在表中有大约100个实例。 Every subject is tied to exactly one class as you can see with "larry" below. 每个主题都与一个类完全相关，你可以在下面看到“拉里”。

    * field  * field  *   id   *  class  
*******************************************
 0  *   _    *   _    *  bob   *    a
 1  *   _    *   _    *  susan *    a
 2  *   _    *   _    *  susan *    a
 3  *   _    *   _    *  bob   *    a
 4  *   _    *   _    *  larry *    b
 5  *   _    *   _    *  greg  *    a
 6  *   _    *   _    *  larry *    b
 7  *   _    *   _    *  bob   *    a
 8  *   _    *   _    *  susan *    a
 9  *   _    *   _    *  susan *    a
 10 *   _    *   _    *  bob   *    a
 11 *   _    *   _    *  greg  *    a
 ...   ...      ...      ...       ...

I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. 我想使用交叉验证来调整模型，并且必须对数据集进行分层，以便每个折叠包含一些少数类的例子，b。 The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject. 问题是我有第二个约束，相同的id必须永远不会出现在两个不同的折叠中，因为这会泄漏有关主题的信息。

I'm using python's scikit-learn library. 我正在使用python的scikit-learn库。 I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. 我需要一种结合LabelKFold的方法，它确保标签（id）不会在折叠之间分割，而StratifiedKFold则确保每个折叠都具有相似的类别比例。 How can I accomplish the above using scikit-learn? 如何使用scikit-learn完成上述操作？ If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries? 如果无法在sklearn中拆分两个约束，那么如何手动或与其他python库有效地分割数据集？

Answer 1

The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple. 以下在索引方面有点棘手（如果你使用像Pandas这样的东西会有所帮助），但在概念上很简单。

Suppose you make a dummy dataset where the independent variables are only id and class . 假设您创建一个虚拟数据集，其中自变量只是id和class 。 Furthermore, in this dataset, remove duplicate id entries. 此外，在此数据集中，删除重复的id条目。

For your cross validation, run stratified cross validation on the dummy dataset. 对于交叉验证，在虚拟数据集上运行分层交叉验证。 At each iteration: 在每次迭代时：

Find out which id s were selected for the train and the test 找出为火车和测试选择的id
Go back to the original dataset, and insert all the instances belonging to id as necessary into train and test sets. 返回原始数据集，并根据需要将属于id所有实例插入到训练集和测试集中。

This works because: 这是因为：

As you stated, each id is associated with a single label. 如您所述，每个id都与一个标签相关联。
Since we run stratified CV, each class is represented proportionally. 由于我们运行分层CV，每个类都按比例表示。
Since each id appears only in the train or test set (but not both), it is labeled too. 由于每个id仅出现在火车或测试集中（但不是两者），因此它也被标记。

Scikit-Learn中的分层标记K-fold交叉验证

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-09-03 15:09:03

Scikit-Learn中的分层标记K-fold交叉验证

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-09-03 15:09:03

解决方案1
4 已采纳 2016-09-03 15:09:03