I was confused about using train_test_split()
in sklearn. Here is a code snippet of something I've tried:
X = example_df.drop('features', axis=1)
y = example_df['price']
y_test, X_train, X_test, y_train= train_test_split(X, y, test_size=0.2)
How does it split the rows of example_df
? example_df
has 100 rows, so I expected the datasets to be split with the following sizes.
y_test
should have 80 rows
X_train
should have 20 rows
X_test
should have 80 rows
y_train
should have 20 rows
But the sizes of my datasets were, respectively: 20, 20, 80, 20.
Why is this?
So in this case, the correct order of variable assignment you should be using is X_train, X_test, y_train, y_test
. ie you need to rewrite your code to be
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Multiple assignment unpacking
Additionally, I suspect your confusion might also come from a misunderstanding of how multiple assignment unpacking in python works, and the return value of train_test_split(...)
.
Let us consider giving train_test_split(...)
a single array to split, eg y = [0, 1, 2, 3, 4]
. train_test_split(y)
gives us an output that will look something like [[0, 1, 2], [3, 4]]
(plus or minus some shuffling). We see that it takes our original list and returns a list of 2 lists .
We can pass train_test_split(...)
an arbitrary number of lists to split. So let's see what happens if we gave train_test_split(...)
2 lists ( list_1
, list_2
) as input. It would return a list of 4 lists . The first two inner lists would be the training set of list_1
followed by the testing set of list_1
, and the last two inner lists would be the training set of list_2
followed by the testing set of list_2
. The returned lists however, are do not correspond to any keywords such as "X_train" or "x_test", they're just good old regular lists.
One way to handle the output would be like this
datasets = train_test_split(list_1, list_2)
list_1_train = datasets[0]
list_1_test = datasets[1]
list_2_train = datasets[2]
list_2_test = datasets[3]
However this is lengthy, repetitive, and prone to bugs. Thankfully, python gives us the syntax to unpack multiple variables and assign them in a single statement. The equivalent of assigning the four lists as shown in the above code snippet would be to do this:
[list_1_train, list_1_test, list_2_train, list_2_test] = train_test_split(list_1, list_2)
or with more sugar :
list_1_train, list_1_test, list_2_train, list_2_test = train_test_split(list_1, list_2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.