简体   繁体   English

如何在不使用和拆分测试集的情况下将我的数据集拆分为训练和验证?

[英]How can i split my dataset into training and validation with no using and spliting test set?

I know this is wrong for training and validation sets spliting,but you can understand here what i really need.我知道这对于训练和验证集拆分是错误的,但你可以在这里理解我真正需要的东西。 I want to use just training set and validation set.我只想使用训练集和验证集。 I don't need any test set我不需要任何测试集

#Data Split
from sklearn.model_selection import train_test_split 

x_train,x_val,y_train,y_val=train_test_split(x,y,test_size=0.976,random_state=0)

The test is the validation;测试是验证;

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

x_test and y_test is your validation test or test set. x_testy_test是您的验证测试或测试集。 They are the same.他们是一样的。 It is a small slice of the total x , y samples to validate your model on data it hasn't been trained on.这是总xy样本的一小部分,用于验证您的 model 尚未经过训练的数据。

By using random_state you get reproducible results.通过使用random_state ,您可以获得可重现的结果。 In other words, you get the same sets each times you run the script.换句话说,每次运行脚本时都会得到相同的集合。

The terms validation set and test set are sometimes used to interchangeably, and sometimes to mean slightly different things.术语validation集和test集有时可互换使用,有时表示略有不同。 @Sy Ker's point is correct: the sklearn module you're using does provide you with a validation set, though the term used in the module is test . @Sy Ker 的观点是正确的:您使用的sklearn模块确实为您提供了一个验证集,尽管模块中使用的术语是test Effectively, what you're doing is getting data for training and data for evaluation, regardless of the term used.实际上,您所做的是获取用于训练的数据和用于评估的数据,无论使用什么术语。 I'm adding this answer to answer that you might, in fact, need a form of test set.我添加这个答案是为了回答您实际上可能需要某种形式的测试集。

Using test_train_split will give you a pair of sets that allow you to train a model (with a proportion specified in the percentage argument -- which, generally, should be something like 10-25% to ensure that it's a representative subsample).使用test_train_split将为您提供一对允许您训练 model 的集合(在百分比参数中指定比例 - 通常应该为 10-25% 左右,以确保它是一个有代表性的子样本)。 But I would suggest thinking of the process a little more broadly.但我建议更广泛地考虑这个过程。

Splitting data for use in testing and model evaluation can be done simply (and, likely, incorrectly) by just using some y% of the rows from the bottom of a dataset.只需使用数据集底部的y%行,就可以简单地(并且很可能不正确地)拆分数据以用于测试和 model 评估。 If normalization/standardization is being done, then make sure it train that on the test set and apply it to the set for evaluation so that the same treatment is applied to both.如果正在进行标准化/标准化,请确保它在测试集上进行训练并将其应用于评估集,以便对两者应用相同的处理。

sklearn and others have also made it possible to do cross-validation very simply, and in this case "validation" sets should be thought of a little differently. sklearn和其他人也可以非常简单地进行交叉验证,在这种情况下,应该对“验证”集有所不同。 Cross-validation will take a portion of your data and subdivide it into smaller groups for repeated testing-and-training passes.交叉验证将获取您的一部分数据并将其细分为更小的组,以进行重复的测试和培训。 In this case, you might start with a split of data like that from train_test_split , and keep the "test" set in this case as a total holdout -- meaning that the cross-validation procedure never uses (or "sees") the data during it's test/train process.在这种情况下,您可能会从train_test_split中的数据拆分开始,并将在这种情况下设置的“测试”保留为完全保留——这意味着交叉验证过程从不使用(或“看到”)数据在它的测试/训练过程中。

That test set that you got from the test_train_split process can then serve as a good set of data to use as a test for how the model performs against data it has never seen.然后,您从test_train_split过程中获得的测试集可以作为一组很好的数据,用作测试 model 如何针对从未见过的数据执行的测试。 You might see this referred to as a "holdout" set, or again as some version of "test" and/or "validation".您可能会看到这被称为“holdout”集,或者再次称为“test”和/或“validation”的某个版本。

This link has a quick, but intuitive, description of cross-validation and holdout sets. 此链接对交叉验证和保留集进行了快速但直观的描述。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将此数据集拆分为训练集、验证集和测试集? - How can I split this dataset into train, validation, and test set? 我什么时候可以在 PCA 之前或之后将我的数据集拆分为训练和验证集? - When I can split my dataset into training and validation set before PCA or after PCA? 如何使用 pytorch 将数据集拆分为自定义训练集和自定义验证集? - How to split a dataset into a custom training set and a custom validation set with pytorch? 如何将图像数据集拆分为 python 中的测试/训练/验证集? - How to split images dataset into test/training/validation sets in python? 如何将数据集拆分为训练集和验证集,保持类之间的比例? - how to split a dataset into training and validation set keeping ratio between classes? 如何在机器学习中使用不同的数据集测试我的训练 model - How can I test my training model using a different dataset in machine learning 如何正确拆分不平衡数据集以训练和测试集? - How can I properly split imbalanced dataset to train and test set? 如何在 pyspark 上创建分层拆分训练、验证和测试集? - How to create stratified split training, validation, and test set on pyspark? 如何将自定义数据集拆分为训练数据集和测试数据集? - How do I split a custom dataset into training and test datasets? 从头开始加载 MNIST 数据集并将其拆分为训练-验证-测试集 - Load MNIST dataset from scratch and split it in training-validation-test set
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM