如何在不使用和拆分测试集的情况下将我的数据集拆分为训练和验证？

Question

I know this is wrong for training and validation sets spliting,but you can understand here what i really need.我知道这对于训练和验证集拆分是错误的，但你可以在这里理解我真正需要的东西。 I want to use just training set and validation set.我只想使用训练集和验证集。 I don't need any test set我不需要任何测试集

#Data Split
from sklearn.model_selection import train_test_split 

x_train,x_val,y_train,y_val=train_test_split(x,y,test_size=0.976,random_state=0)

Answer 1

The test is the validation;测试是验证；

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

x_test and y_test is your validation test or test set. x_test和y_test是您的验证测试或测试集。 They are the same.他们是一样的。 It is a small slice of the total x , y samples to validate your model on data it hasn't been trained on.这是总x ， y样本的一小部分，用于验证您的 model 尚未经过训练的数据。

By using random_state you get reproducible results.通过使用random_state ，您可以获得可重现的结果。 In other words, you get the same sets each times you run the script.换句话说，每次运行脚本时都会得到相同的集合。

Answer 2

The terms validation set and test set are sometimes used to interchangeably, and sometimes to mean slightly different things.术语validation集和test集有时可互换使用，有时表示略有不同。 @Sy Ker's point is correct: the sklearn module you're using does provide you with a validation set, though the term used in the module is test . @Sy Ker 的观点是正确的：您使用的sklearn模块确实为您提供了一个验证集，尽管模块中使用的术语是test 。 Effectively, what you're doing is getting data for training and data for evaluation, regardless of the term used.实际上，您所做的是获取用于训练的数据和用于评估的数据，无论使用什么术语。 I'm adding this answer to answer that you might, in fact, need a form of test set.我添加这个答案是为了回答您实际上可能需要某种形式的测试集。

Using test_train_split will give you a pair of sets that allow you to train a model (with a proportion specified in the percentage argument -- which, generally, should be something like 10-25% to ensure that it's a representative subsample).使用test_train_split将为您提供一对允许您训练 model 的集合（在百分比参数中指定比例 - 通常应该为 10-25% 左右，以确保它是一个有代表性的子样本）。 But I would suggest thinking of the process a little more broadly.但我建议更广泛地考虑这个过程。

Splitting data for use in testing and model evaluation can be done simply (and, likely, incorrectly) by just using some y% of the rows from the bottom of a dataset.只需使用数据集底部的y%行，就可以简单地（并且很可能不正确地）拆分数据以用于测试和 model 评估。 If normalization/standardization is being done, then make sure it train that on the test set and apply it to the set for evaluation so that the same treatment is applied to both.如果正在进行标准化/标准化，请确保它在测试集上进行训练并将其应用于评估集，以便对两者应用相同的处理。

sklearn and others have also made it possible to do cross-validation very simply, and in this case "validation" sets should be thought of a little differently. sklearn和其他人也可以非常简单地进行交叉验证，在这种情况下，应该对“验证”集有所不同。 Cross-validation will take a portion of your data and subdivide it into smaller groups for repeated testing-and-training passes.交叉验证将获取您的一部分数据并将其细分为更小的组，以进行重复的测试和培训。 In this case, you might start with a split of data like that from train_test_split , and keep the "test" set in this case as a total holdout -- meaning that the cross-validation procedure never uses (or "sees") the data during it's test/train process.在这种情况下，您可能会从train_test_split中的数据拆分开始，并将在这种情况下设置的“测试”保留为完全保留——这意味着交叉验证过程从不使用（或“看到”）数据在它的测试/训练过程中。

That test set that you got from the test_train_split process can then serve as a good set of data to use as a test for how the model performs against data it has never seen.然后，您从test_train_split过程中获得的测试集可以作为一组很好的数据，用作测试 model 如何针对从未见过的数据执行的测试。 You might see this referred to as a "holdout" set, or again as some version of "test" and/or "validation".您可能会看到这被称为“holdout”集，或者再次称为“test”和/或“validation”的某个版本。

This link has a quick, but intuitive, description of cross-validation and holdout sets. 此链接对交叉验证和保留集进行了快速但直观的描述。

如何在不使用和拆分测试集的情况下将我的数据集拆分为训练和验证？

问题描述

2 个解决方案

解决方案1
0 2020-06-08 16:13:26

解决方案2
0 2020-06-08 16:28:14

如何在不使用和拆分测试集的情况下将我的数据集拆分为训练和验证？

问题描述

2 个解决方案

解决方案1 0 2020-06-08 16:13:26

解决方案2 0 2020-06-08 16:28:14

解决方案1
0 2020-06-08 16:13:26

解决方案2
0 2020-06-08 16:28:14