简体   繁体   English

分配 Sklearn train_test_split() 返回的变量

[英]Assigning variables returned by Sklearn train_test_split()

I was confused about using train_test_split() in sklearn.我对在 sklearn 中使用train_test_split()感到困惑。 Here is a code snippet of something I've tried:这是我尝试过的代码片段:

X = example_df.drop('features', axis=1)
y = example_df['price']

y_test, X_train, X_test, y_train= train_test_split(X, y, test_size=0.2)

How does it split the rows of example_df ?它如何拆分example_df的行? example_df has 100 rows, so I expected the datasets to be split with the following sizes. example_df有 100 行,所以我希望数据集按照以下大小进行拆分。

  1. y_test should have 80 rows y_test应该有 80 行

  2. X_train should have 20 rows X_train应该有 20 行

  3. X_test should have 80 rows X_test应该有 80 行

  4. y_train should have 20 rows y_train应该有 20 行

But the sizes of my datasets were, respectively: 20, 20, 80, 20.但我的数据集大小分别为:20、20、80、20。

Why is this?为什么是这样?

So in this case, the correct order of variable assignment you should be using is X_train, X_test, y_train, y_test .所以在这种情况下,您应该使用的变量赋值的正确顺序是X_train, X_test, y_train, y_test ie you need to rewrite your code to be即你需要重写你的代码

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Multiple assignment unpacking多重赋值拆包

Additionally, I suspect your confusion might also come from a misunderstanding of how multiple assignment unpacking in python works, and the return value of train_test_split(...) .此外,我怀疑您的困惑也可能来自对 python 中的多重分配解包如何工作以及train_test_split(...)的返回值的误解。

Let us consider giving train_test_split(...) a single array to split, eg y = [0, 1, 2, 3, 4] .让我们考虑给train_test_split(...)一个要拆分的数组,例如y = [0, 1, 2, 3, 4] train_test_split(y) gives us an output that will look something like [[0, 1, 2], [3, 4]] (plus or minus some shuffling). train_test_split(y)给我们一个 output 看起来像[[0, 1, 2], [3, 4]] (加上或减去一些改组)。 We see that it takes our original list and returns a list of 2 lists .我们看到它接受我们的原始列表并返回一个包含 2 个列表的列表

We can pass train_test_split(...) an arbitrary number of lists to split.我们可以通过train_test_split(...)任意数量的列表进行拆分。 So let's see what happens if we gave train_test_split(...) 2 lists ( list_1 , list_2 ) as input.所以让我们看看如果我们给train_test_split(...) 2 个列表( list_1list_2 )作为输入会发生什么。 It would return a list of 4 lists .它将返回一个包含 4 个列表的列表 The first two inner lists would be the training set of list_1 followed by the testing set of list_1 , and the last two inner lists would be the training set of list_2 followed by the testing set of list_2 .前两个内部列表将是list_1的训练集,然后是list_1的测试集,最后两个内部列表将是list_2的训练集,然后是list_2的测试集。 The returned lists however, are do not correspond to any keywords such as "X_train" or "x_test", they're just good old regular lists.然而,返回的列表不对应任何关键字,例如“X_train”或“x_test”,它们只是很好的旧常规列表。

One way to handle the output would be like this处理 output 的一种方法是这样的

datasets = train_test_split(list_1, list_2)
list_1_train = datasets[0]
list_1_test = datasets[1]
list_2_train = datasets[2]
list_2_test = datasets[3]

However this is lengthy, repetitive, and prone to bugs.然而,这是冗长的、重复的并且容易出现错误。 Thankfully, python gives us the syntax to unpack multiple variables and assign them in a single statement.值得庆幸的是,python 为我们提供了解包多个变量并在单个语句中分配它们的语法。 The equivalent of assigning the four lists as shown in the above code snippet would be to do this:如上面的代码片段所示,分配四个列表的等价物是这样做:

[list_1_train, list_1_test, list_2_train, list_2_test] = train_test_split(list_1, list_2)

or with more sugar :或加更多

list_1_train, list_1_test, list_2_train, list_2_test = train_test_split(list_1, list_2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM