简体   繁体   English

Scikit-Learn中的分层标记K-fold交叉验证

[英]Stratified Labeled K-Fold Cross-Validation In Scikit-Learn

I'm trying to classify instances of a dataset as being in one of two classes, a or b. 我正在尝试将数据集的实例分类为两个类中的一个,a或b。 B is a minority class and only makes up 8% of the dataset. B是少数类,只占数据集的8%。 All instances are assigned an id indicating which subject generated the data. 为所有实例分配一个id,指示哪个主题生成了数据。 Because every subject generated multiple instances id's are repeated frequently in the dataset. 因为每个主题生成多个实例,所以id在数据集中频繁重复。

The table below is just an example, the real table has about 100,000 instances. 下表只是一个例子,真实表有大约100,000个实例。 Each subject id has about 100 instances in the table. 每个主题id在表中有大约100个实例。 Every subject is tied to exactly one class as you can see with "larry" below. 每个主题都与一个类完全相关,你可以在下面看到“拉里”。

    * field  * field  *   id   *  class  
*******************************************
 0  *   _    *   _    *  bob   *    a
 1  *   _    *   _    *  susan *    a
 2  *   _    *   _    *  susan *    a
 3  *   _    *   _    *  bob   *    a
 4  *   _    *   _    *  larry *    b
 5  *   _    *   _    *  greg  *    a
 6  *   _    *   _    *  larry *    b
 7  *   _    *   _    *  bob   *    a
 8  *   _    *   _    *  susan *    a
 9  *   _    *   _    *  susan *    a
 10 *   _    *   _    *  bob   *    a
 11 *   _    *   _    *  greg  *    a
 ...   ...      ...      ...       ...

I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. 我想使用交叉验证来调整模型,并且必须对数据集进行分层,以便每个折叠包含一些少数类的例子,b。 The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject. 问题是我有第二个约束,相同的id必须永远不会出现在两个不同的折叠中,因为这会泄漏有关主题的信息。

I'm using python's scikit-learn library. 我正在使用python的scikit-learn库。 I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. 我需要一种结合LabelKFold的方法,它确保标签(id)不会在折叠之间分割,而StratifiedKFold则确保每个折叠都具有相似的类别比例。 How can I accomplish the above using scikit-learn? 如何使用scikit-learn完成上述操作? If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries? 如果无法在sklearn中拆分两个约束,那么如何手动或与其他python库有效地分割数据集?

The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple. 以下在索引方面有点棘手(如果你使用像Pandas这样的东西会有所帮助),但在概念上很简单。

Suppose you make a dummy dataset where the independent variables are only id and class . 假设您创建一个虚拟数据集,其中自变量只是idclass Furthermore, in this dataset, remove duplicate id entries. 此外,在此数据集中,删除重复的id条目。

For your cross validation, run stratified cross validation on the dummy dataset. 对于交叉验证,在虚拟数据集上运行分层交叉验证。 At each iteration: 在每次迭代时:

  1. Find out which id s were selected for the train and the test 找出为火车和测试选择的id

  2. Go back to the original dataset, and insert all the instances belonging to id as necessary into train and test sets. 返回原始数据集,并根据需要将属于id所有实例插入到训练集和测试集中。

This works because: 这是因为:

  1. As you stated, each id is associated with a single label. 如您所述,每个id都与一个标签相关联。

  2. Since we run stratified CV, each class is represented proportionally. 由于我们运行分层CV,每个类都按比例表示。

  3. Since each id appears only in the train or test set (but not both), it is labeled too. 由于每个id仅出现在火车或测试集中(但不是两者),因此它也被标记。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 scikit-learn中的随机分层k-fold交叉验证? - Randomized stratified k-fold cross-validation in scikit-learn? 如何在scikit-learn中使用k折交叉验证来获得每折的精确召回率? - How can I use k-fold cross-validation in scikit-learn to get precision-recall per fold? k-fold分层交叉验证与不平衡类 - k-fold stratified cross-validation with imbalanced classes scikit-learn:为什么这个 2 折交叉验证图看起来像 4 折交叉验证? - scikit-learn: Why does this 2-fold cross-validation figure looks like 4-fold cross-validation? 使用 shuffle=True 的“正常”k 折交叉验证和重复的 k 折交叉验证有什么区别? - What is the difference between a “normal” k-fold cross-validation using shuffle=True and a repeated k-fold cross-validation? 在Scikit学习分类器上使用交叉验证 - Using Cross-Validation on a Scikit-Learn Classifer 使用scikit-learn LinearRegression进行意外的交叉验证评分 - Unexpected cross-validation scores with scikit-learn LinearRegression 在 scikit-learn 中用于交叉验证的自定义折叠 - Custom folds for cross-validation in scikit-learn 如何在 scikit-learn 中生成自定义交叉验证生成器? - How to generate a custom cross-validation generator in scikit-learn? scikit-learn 中每个数据拆分的交叉验证指标 - Cross-validation metrics in scikit-learn for each data split
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM