简体   繁体   中英

How can i split my dataset into training and validation with no using and spliting test set?

I know this is wrong for training and validation sets spliting,but you can understand here what i really need. I want to use just training set and validation set. I don't need any test set

#Data Split
from sklearn.model_selection import train_test_split 

x_train,x_val,y_train,y_val=train_test_split(x,y,test_size=0.976,random_state=0)

The test is the validation;

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

x_test and y_test is your validation test or test set. They are the same. It is a small slice of the total x , y samples to validate your model on data it hasn't been trained on.

By using random_state you get reproducible results. In other words, you get the same sets each times you run the script.

The terms validation set and test set are sometimes used to interchangeably, and sometimes to mean slightly different things. @Sy Ker's point is correct: the sklearn module you're using does provide you with a validation set, though the term used in the module is test . Effectively, what you're doing is getting data for training and data for evaluation, regardless of the term used. I'm adding this answer to answer that you might, in fact, need a form of test set.

Using test_train_split will give you a pair of sets that allow you to train a model (with a proportion specified in the percentage argument -- which, generally, should be something like 10-25% to ensure that it's a representative subsample). But I would suggest thinking of the process a little more broadly.

Splitting data for use in testing and model evaluation can be done simply (and, likely, incorrectly) by just using some y% of the rows from the bottom of a dataset. If normalization/standardization is being done, then make sure it train that on the test set and apply it to the set for evaluation so that the same treatment is applied to both.

sklearn and others have also made it possible to do cross-validation very simply, and in this case "validation" sets should be thought of a little differently. Cross-validation will take a portion of your data and subdivide it into smaller groups for repeated testing-and-training passes. In this case, you might start with a split of data like that from train_test_split , and keep the "test" set in this case as a total holdout -- meaning that the cross-validation procedure never uses (or "sees") the data during it's test/train process.

That test set that you got from the test_train_split process can then serve as a good set of data to use as a test for how the model performs against data it has never seen. You might see this referred to as a "holdout" set, or again as some version of "test" and/or "validation".

This link has a quick, but intuitive, description of cross-validation and holdout sets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM