Assigning variables returned by Sklearn train_test_split()

Question

I was confused about using train_test_split() in sklearn. Here is a code snippet of something I've tried:

X = example_df.drop('features', axis=1)
y = example_df['price']

y_test, X_train, X_test, y_train= train_test_split(X, y, test_size=0.2)

How does it split the rows of example_df ? example_df has 100 rows, so I expected the datasets to be split with the following sizes.

y_test should have 80 rows
X_train should have 20 rows
X_test should have 80 rows
y_train should have 20 rows

But the sizes of my datasets were, respectively: 20, 20, 80, 20.

Why is this?

Answer 1

So in this case, the correct order of variable assignment you should be using is X_train, X_test, y_train, y_test . ie you need to rewrite your code to be

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Multiple assignment unpacking

Additionally, I suspect your confusion might also come from a misunderstanding of how multiple assignment unpacking in python works, and the return value of train_test_split(...) .

Let us consider giving train_test_split(...) a single array to split, eg y = [0, 1, 2, 3, 4] . train_test_split(y) gives us an output that will look something like [[0, 1, 2], [3, 4]] (plus or minus some shuffling). We see that it takes our original list and returns a list of 2 lists .

We can pass train_test_split(...) an arbitrary number of lists to split. So let's see what happens if we gave train_test_split(...) 2 lists ( list_1 , list_2 ) as input. It would return a list of 4 lists . The first two inner lists would be the training set of list_1 followed by the testing set of list_1 , and the last two inner lists would be the training set of list_2 followed by the testing set of list_2 . The returned lists however, are do not correspond to any keywords such as "X_train" or "x_test", they're just good old regular lists.

One way to handle the output would be like this

datasets = train_test_split(list_1, list_2)
list_1_train = datasets[0]
list_1_test = datasets[1]
list_2_train = datasets[2]
list_2_test = datasets[3]

However this is lengthy, repetitive, and prone to bugs. Thankfully, python gives us the syntax to unpack multiple variables and assign them in a single statement. The equivalent of assigning the four lists as shown in the above code snippet would be to do this:

[list_1_train, list_1_test, list_2_train, list_2_test] = train_test_split(list_1, list_2)

or with more sugar :

list_1_train, list_1_test, list_2_train, list_2_test = train_test_split(list_1, list_2)

Assigning variables returned by Sklearn train_test_split()

Question

1 answers

solution1
1 ACCPTED 2020-05-28 17:19:09

Assigning variables returned by Sklearn train_test_split()

Question

1 answers

solution1 1 ACCPTED 2020-05-28 17:19:09

solution1
1 ACCPTED 2020-05-28 17:19:09