[英]Creating train/test/val split with StratifiedKFold
I'm trying to use StratifiedKFold
to create train/test/val splits for use in a non-sklearn machine learning work flow.我正在尝试使用StratifiedKFold
创建用于非 sklearn 机器学习工作流程的训练/测试/验证拆分。 So, the DataFrame needs to be split and then stay that way.因此,DataFrame 需要拆分,然后保持原样。
I'm trying to do it like the following, using .values
because I'm passing pandas DataFrames:我正在尝试使用.values
执行以下操作,因为我正在传递 Pandas DataFrames:
skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)
for train_index, test_index, valid_index in skf.split(X.values, y.values):
print("TRAIN:", train_index, "TEST:", test_index, "VALID:", valid_index)
X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]
This fails with:这失败了:
ValueError: not enough values to unpack (expected 3, got 2).
I read through all of the sklearn
docs and ran the example code, but did not gain a better understanding of how to use stratified k fold splits outside of a sklearn
cross-validation scenario.我通读了所有sklearn
文档并运行了示例代码,但没有更好地理解如何在sklearn
交叉验证场景之外使用分层的 k 折叠拆分。
EDIT:编辑:
I also tried like this:我也试过这样:
# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)
Which seems to work, although I imagine I'm messing with the stratification by doing so.这似乎有效,虽然我想我这样做是在搞乱分层。
StratifiedKFold can only be used to split your dataset into two parts per fold. StratifiedKFold 只能用于将数据集每折叠分成两部分。 You are getting an error because the split()
method will only yield a tuple of train_index and test_index (see https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py#L94 ).您收到错误,因为split()
方法只会产生一个 train_index 和 test_index 元组(参见https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py# L94 )。
For this use case you should first split your data into validation and rest, and then split the rest again into test and train like such:对于此用例,您应该首先将数据拆分为验证和其余数据,然后将其余数据再次拆分为测试和训练,如下所示:
X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)
I'm not exactly sure if this question is about KFold or just stratified splits, but I wrote this quick wrapper for StratifiedKFold with a cross validation set.我不确定这个问题是关于 KFold 还是只是分层拆分,但我用交叉验证集为 StratifiedKFold 编写了这个快速包装器。
from sklearn.model_selection import StratifiedKFold, train_test_split
class StratifiedKFold3(StratifiedKFold):
def split(self, X, y, groups=None):
s = super().split(X, y, groups)
for train_indxs, test_indxs in s:
y_train = y[train_indxs]
train_indxs, cv_indxs = train_test_split(train_indxs,stratify=y_train, test_size=(1 / (self.n_splits - 1)))
yield train_indxs, cv_indxs, test_indxs
It can be used like this:它可以像这样使用:
X = np.random.rand(100)
y = np.random.choice([0,1],100)
g = KFold3(10).split(X,y)
train, cv, test = next(g)
train.shape, cv.shape, test.shape
>> ((80,), (10,), (10,))
In stratify
parameter, pass the target to stratify.在stratify
参数中,通过目标进行分层。 First, inform the complete target array ( y
in my case).首先,通知完整的目标数组(在我的例子中是y
)。 Then, in the next split, inform the target that was split ( y_train
in my case):然后,在下一次拆分中,通知拆分的目标(在我的情况下为y_train
):
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.