简体   繁体   中英

Difficulty in understanding the outputs of train test and validation data in SkLearn

I have a dataset which is around 10k and I am splitting the data into a 80:20 ratio in sklearn's train_test_split module...However I fail to understand the reason behind the outputs not matching the original dataset when they are added. for eg here's the size of my dataset created using df.shape (9538, 15) . Now if I put this into train_test_split I get something like

from sklearn.model_selection import train_test_split
train, test =  train_test_split(df_fake,test_size=0.2, random_state=0)
train, val =  train_test_split(df_fake,test_size=0.25,random_state=0)
print('Train-',train.shape)
print('Val-',val.shape)
print('Test-',test.shape)

the outputs:-

Train- (7153, 15)
Val- (2385, 15)
Test- (1908, 15)

So if I add the testset with the validation set it comes to - 4293 and when this figure is added to the train set it comes to 11446. whereas I have got the data of only 9.5K.. Is it something that I am doing wrong?

Are you wanting something more like this:

train, test_val = train_test_split(df_fake,test_size=0.2, random_state=0)
test,  val      = train_test_split(test_val,test_size=0.5,random_state=0)

Now you would have train,test,val of sizes 80/10/10.

Your issue is that you're setting the train variable twice in your code and overwriting it. Sklearn's train_test_split function splits the data in two parts. So your train + val datasets add up to the correct number. If you want to split three ways, try splitting once and then splitting one of the resulting datasets again. If you're going for an 80-10-10 split, you need to first cut out 20% and then cut that in half again:

from sklearn.model_selection import train_test_split
train, tv = train_test_split(df_fake, test_size=0.2, random_state=0)
test, val = train_test_split(tv, test_size=0.5, random_state=0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM