简体   繁体   English

使用 StratifiedKFold 创建训练/测试/验证分割

[英]Creating train/test/val split with StratifiedKFold

I'm trying to use StratifiedKFold to create train/test/val splits for use in a non-sklearn machine learning work flow.我正在尝试使用StratifiedKFold创建用于非 sklearn 机器学习工作流程的训练/测试/验证拆分。 So, the DataFrame needs to be split and then stay that way.因此,DataFrame 需要拆分,然后保持原样。

I'm trying to do it like the following, using .values because I'm passing pandas DataFrames:我正在尝试使用.values执行以下操作,因为我正在传递 Pandas DataFrames:

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

This fails with:这失败了:

ValueError: not enough values to unpack (expected 3, got 2).

I read through all of the sklearn docs and ran the example code, but did not gain a better understanding of how to use stratified k fold splits outside of a sklearn cross-validation scenario.我通读了所有sklearn文档并运行了示例代码,但没有更好地理解如何在sklearn交叉验证场景之外使用分层的 k 折叠拆分。

EDIT:编辑:

I also tried like this:我也试过这样:

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)

Which seems to work, although I imagine I'm messing with the stratification by doing so.这似乎有效,虽然我想我这样做是在搞乱分层。

StratifiedKFold can only be used to split your dataset into two parts per fold. StratifiedKFold 只能用于将数据集每折叠分成两部分。 You are getting an error because the split() method will only yield a tuple of train_index and test_index (see https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py#L94 ).您收到错误,因为split()方法只会产生一个 train_index 和 test_index 元组(参见https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py# L94 )。

For this use case you should first split your data into validation and rest, and then split the rest again into test and train like such:对于此用例,您应该首先将数据拆分为验证和其余数据,然后将其余数据再次拆分为测试和训练,如下所示:

X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)

I'm not exactly sure if this question is about KFold or just stratified splits, but I wrote this quick wrapper for StratifiedKFold with a cross validation set.我不确定这个问题是关于 KFold 还是只是分层拆分,但我用交叉验证集为 StratifiedKFold 编写了这个快速包装器。

from sklearn.model_selection import StratifiedKFold, train_test_split

class StratifiedKFold3(StratifiedKFold):

    def split(self, X, y, groups=None):
        s = super().split(X, y, groups)
        for train_indxs, test_indxs in s:
            y_train = y[train_indxs]
            train_indxs, cv_indxs = train_test_split(train_indxs,stratify=y_train, test_size=(1 / (self.n_splits - 1)))
            yield train_indxs, cv_indxs, test_indxs

It can be used like this:它可以像这样使用:

X = np.random.rand(100)
y = np.random.choice([0,1],100)
g = KFold3(10).split(X,y)
train, cv, test = next(g)
train.shape, cv.shape, test.shape
>> ((80,), (10,), (10,))

In stratify parameter, pass the target to stratify.stratify参数中,通过目标进行分层。 First, inform the complete target array ( y in my case).首先,通知完整的目标数组(在我的例子中是y )。 Then, in the next split, inform the target that was split ( y_train in my case):然后,在下一次拆分中,通知拆分的目标(在我的情况下为y_train ):

X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将带有图像的文件夹拆分为训练、验证和测试? - How to split folder with images into train, val and test? python 使用 GroupShuffleSplit 拆分训练/测试/验证 - python split to train/test/val using GroupShuffleSplit 训练/验证/测试分割时间 LSTM - Train / Val / Test split time LSTM 是否需要将数据一分为三; 训练、验证和测试? - Is it necessary to split data into three; train, val and test? StratifiedKFold 拆分训练和验证集大小 - StratifiedKFold split train and validation set size 分组训练并按分组+ sklearn cross_val_score进行测试 - split into train and test by group+ sklearn cross_val_score 将具有相同主题的多个条目的数据集拆分为具有预定义比例的训练/验证/测试,并且拆分中没有相同的主题 - Split data set with multiple entries of same subject into train/val/test with predefined proportions and no same subjects in split 使用train_val_split()或update_test_indices()拆分数据集时的Keras文本ValueError - Keras-text ValueError when splitting dataset using train_val_split() or update_test_indices() 为什么train_test_split和管道cross_val_score之间的r2_score有很大不同? - why r2_score is quite different between train_test_split and pipeline cross_val_score? cross_val_score和train_test_split之间的分数不同 - different score between cross_val_score and train_test_split
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM