使用 StratifiedKFold 创建训练/测试/验证分割

Question

I'm trying to use StratifiedKFold to create train/test/val splits for use in a non-sklearn machine learning work flow.我正在尝试使用StratifiedKFold创建用于非 sklearn 机器学习工作流程的训练/测试/验证拆分。 So, the DataFrame needs to be split and then stay that way.因此，DataFrame 需要拆分，然后保持原样。

I'm trying to do it like the following, using .values because I'm passing pandas DataFrames:我正在尝试使用.values执行以下操作，因为我正在传递 Pandas DataFrames：

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

This fails with:这失败了：

ValueError: not enough values to unpack (expected 3, got 2).

I read through all of the sklearn docs and ran the example code, but did not gain a better understanding of how to use stratified k fold splits outside of a sklearn cross-validation scenario.我通读了所有sklearn文档并运行了示例代码，但没有更好地理解如何在sklearn交叉验证场景之外使用分层的 k 折叠拆分。

EDIT:编辑：

I also tried like this:我也试过这样：

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)

Which seems to work, although I imagine I'm messing with the stratification by doing so.这似乎有效，虽然我想我这样做是在搞乱分层。

Answer 1

StratifiedKFold can only be used to split your dataset into two parts per fold. StratifiedKFold 只能用于将数据集每折叠分成两部分。 You are getting an error because the split() method will only yield a tuple of train_index and test_index (see https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py#L94 ).您收到错误，因为split()方法只会产生一个 train_index 和 test_index 元组（参见https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py# L94 ）。

For this use case you should first split your data into validation and rest, and then split the rest again into test and train like such:对于此用例，您应该首先将数据拆分为验证和其余数据，然后将其余数据再次拆分为测试和训练，如下所示：

X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)

Answer 2

I'm not exactly sure if this question is about KFold or just stratified splits, but I wrote this quick wrapper for StratifiedKFold with a cross validation set.我不确定这个问题是关于 KFold 还是只是分层拆分，但我用交叉验证集为 StratifiedKFold 编写了这个快速包装器。

from sklearn.model_selection import StratifiedKFold, train_test_split

class StratifiedKFold3(StratifiedKFold):

    def split(self, X, y, groups=None):
        s = super().split(X, y, groups)
        for train_indxs, test_indxs in s:
            y_train = y[train_indxs]
            train_indxs, cv_indxs = train_test_split(train_indxs,stratify=y_train, test_size=(1 / (self.n_splits - 1)))
            yield train_indxs, cv_indxs, test_indxs

It can be used like this:它可以像这样使用：

X = np.random.rand(100)
y = np.random.choice([0,1],100)
g = KFold3(10).split(X,y)
train, cv, test = next(g)
train.shape, cv.shape, test.shape
>> ((80,), (10,), (10,))

Answer 3

In stratify parameter, pass the target to stratify.在stratify参数中，通过目标进行分层。 First, inform the complete target array ( y in my case).首先，通知完整的目标数组（在我的例子中是y ）。 Then, in the next split, inform the target that was split ( y_train in my case):然后，在下一次拆分中，通知拆分的目标（在我的情况下为y_train ）：

X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

使用 StratifiedKFold 创建训练/测试/验证分割

问题描述

3 个解决方案

解决方案1
2 2017-07-20 18:48:21

解决方案2
1 2020-06-02 14:35:07

解决方案3
0 2019-06-04 17:25:20

使用 StratifiedKFold 创建训练/测试/验证分割

问题描述

3 个解决方案

解决方案1 2 2017-07-20 18:48:21

解决方案2 1 2020-06-02 14:35:07

解决方案3 0 2019-06-04 17:25:20

解决方案1
2 2017-07-20 18:48:21

解决方案2
1 2020-06-02 14:35:07

解决方案3
0 2019-06-04 17:25:20