基于组变量训练测试拆分sklearn

Question

My X is as follows: EDIT1:我的 X 如下： EDIT1：

Unique ID.   Exp start date.   Value.    Status.
001          01/01/2020.       4000.     Closed
001          12/01/2019        4000.     Archived
002          01/01/2020.       5000.     Closed
002          12/01/2019        5000.     Archived

I want to make sure that none of the unique IDs that were in training are included in testing.我想确保训练中没有任何唯一 ID 包含在测试中。 I am using sklearn test train split.我正在使用 sklearn 测试火车拆分。 Is this possible?这可能吗？

Answer 1

I believe you need GroupShuffleSplit ( documentation here ).我相信您需要GroupShuffleSplit （此处有文档）。

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
print(groups.shape)

gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)

for train_idx, test_idx in gss.split(X, y, groups):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]

It can be seen from above that train/test indices are created based on the groups variable.从上面可以看出，训练/测试索引是基于groups变量创建的。

In your case, Unique ID.在您的情况下， Unique ID. should be used as groups.应该作为组使用。

Answer 2

Good for you that train_test_split has the stratify parameter. train_test_split具有stratify参数对您有好处。 if you set it to X['Unique ID'] , it means there is no way you can find a unique id in both training and testing set.如果将其设置为X['Unique ID'] ，则意味着您无法在训练和测试集中找到唯一的 ID。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=df['Unique ID'].values)

基于组变量训练测试拆分sklearn

问题描述

2 个解决方案

解决方案1
2 2020-05-15 19:22:13

解决方案2
-1 2020-05-15 19:25:44

基于组变量训练测试拆分sklearn

问题描述

2 个解决方案

解决方案1 2 2020-05-15 19:22:13

解决方案2 -1 2020-05-15 19:25:44

解决方案1
2 2020-05-15 19:22:13

解决方案2
-1 2020-05-15 19:25:44