[英]Train Test Split sklearn based on group variable
My X is as follows: EDIT1:我的 X 如下: EDIT1:
Unique ID. Exp start date. Value. Status.
001 01/01/2020. 4000. Closed
001 12/01/2019 4000. Archived
002 01/01/2020. 5000. Closed
002 12/01/2019 5000. Archived
I want to make sure that none of the unique IDs that were in training are included in testing.我想确保训练中没有任何唯一 ID 包含在测试中。 I am using sklearn test train split.
我正在使用 sklearn 测试火车拆分。 Is this possible?
这可能吗?
I believe you need GroupShuffleSplit
( documentation here ).我相信您需要
GroupShuffleSplit
( 此处有文档)。
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
print(groups.shape)
gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)
for train_idx, test_idx in gss.split(X, y, groups):
print("TRAIN:", train_idx, "TEST:", test_idx)
TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]
It can be seen from above that train/test indices are created based on the groups
variable.从上面可以看出,训练/测试索引是基于
groups
变量创建的。
In your case, Unique ID.
在您的情况下,
Unique ID.
should be used as groups.应该作为组使用。
Good for you that train_test_split has the stratify
parameter. train_test_split具有
stratify
参数对您有好处。 if you set it to X['Unique ID']
, it means there is no way you can find a unique id in both training and testing set.如果将其设置为
X['Unique ID']
,则意味着您无法在训练和测试集中找到唯一的 ID。
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=df['Unique ID'].values)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.