分配 Sklearn train_test_split() 返回的变量

Question

I was confused about using train_test_split() in sklearn.我对在 sklearn 中使用train_test_split()感到困惑。 Here is a code snippet of something I've tried:这是我尝试过的代码片段：

X = example_df.drop('features', axis=1)
y = example_df['price']

y_test, X_train, X_test, y_train= train_test_split(X, y, test_size=0.2)

How does it split the rows of example_df ?它如何拆分example_df的行？ example_df has 100 rows, so I expected the datasets to be split with the following sizes. example_df有 100 行，所以我希望数据集按照以下大小进行拆分。

y_test should have 80 rows y_test应该有 80 行
X_train should have 20 rows X_train应该有 20 行
X_test should have 80 rows X_test应该有 80 行
y_train should have 20 rows y_train应该有 20 行

But the sizes of my datasets were, respectively: 20, 20, 80, 20.但我的数据集大小分别为：20、20、80、20。

Why is this?为什么是这样？

Answer 1

So in this case, the correct order of variable assignment you should be using is X_train, X_test, y_train, y_test .所以在这种情况下，您应该使用的变量赋值的正确顺序是X_train, X_test, y_train, y_test 。 ie you need to rewrite your code to be即你需要重写你的代码

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Multiple assignment unpacking多重赋值拆包

Additionally, I suspect your confusion might also come from a misunderstanding of how multiple assignment unpacking in python works, and the return value of train_test_split(...) .此外，我怀疑您的困惑也可能来自对 python 中的多重分配解包如何工作以及train_test_split(...)的返回值的误解。

Let us consider giving train_test_split(...) a single array to split, eg y = [0, 1, 2, 3, 4] .让我们考虑给train_test_split(...)一个要拆分的数组，例如y = [0, 1, 2, 3, 4] 。 train_test_split(y) gives us an output that will look something like [[0, 1, 2], [3, 4]] (plus or minus some shuffling). train_test_split(y)给我们一个 output 看起来像[[0, 1, 2], [3, 4]] （加上或减去一些改组）。 We see that it takes our original list and returns a list of 2 lists .我们看到它接受我们的原始列表并返回一个包含 2 个列表的列表。

We can pass train_test_split(...) an arbitrary number of lists to split.我们可以通过train_test_split(...)任意数量的列表进行拆分。 So let's see what happens if we gave train_test_split(...) 2 lists ( list_1 , list_2 ) as input.所以让我们看看如果我们给train_test_split(...) 2 个列表（ list_1 ， list_2 ）作为输入会发生什么。 It would return a list of 4 lists .它将返回一个包含 4 个列表的列表。 The first two inner lists would be the training set of list_1 followed by the testing set of list_1 , and the last two inner lists would be the training set of list_2 followed by the testing set of list_2 .前两个内部列表将是list_1的训练集，然后是list_1的测试集，最后两个内部列表将是list_2的训练集，然后是list_2的测试集。 The returned lists however, are do not correspond to any keywords such as "X_train" or "x_test", they're just good old regular lists.然而，返回的列表不对应任何关键字，例如“X_train”或“x_test”，它们只是很好的旧常规列表。

One way to handle the output would be like this处理 output 的一种方法是这样的

datasets = train_test_split(list_1, list_2)
list_1_train = datasets[0]
list_1_test = datasets[1]
list_2_train = datasets[2]
list_2_test = datasets[3]

However this is lengthy, repetitive, and prone to bugs.然而，这是冗长的、重复的并且容易出现错误。 Thankfully, python gives us the syntax to unpack multiple variables and assign them in a single statement.值得庆幸的是，python 为我们提供了解包多个变量并在单个语句中分配它们的语法。 The equivalent of assigning the four lists as shown in the above code snippet would be to do this:如上面的代码片段所示，分配四个列表的等价物是这样做：

[list_1_train, list_1_test, list_2_train, list_2_test] = train_test_split(list_1, list_2)

or with more sugar :或加更多糖：

list_1_train, list_1_test, list_2_train, list_2_test = train_test_split(list_1, list_2)

分配 Sklearn train_test_split() 返回的变量

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-28 17:19:09

分配 Sklearn train_test_split() 返回的变量

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-28 17:19:09

解决方案1
1 已采纳 2020-05-28 17:19:09