model.fit vs model.predict-sklearn中的差异和用法

Question

I am new to ML with Python and was trying my first attempt via a tutorial. 我不熟悉使用Python进行ML，并且正在尝试通过教程进行第一次尝试。 In that tutorial, there are some lines of code that I am having difficulties understanding how they interact with each other. 在该教程中，有些代码行很难理解它们之间的交互方式。

First code is the splitting of the data occurred as the following: 第一个代码是发生的数据拆分，如下所示：

train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)

My first question: Why do we use validation data over test data? 我的第一个问题：为什么我们使用验证数据而不是测试数据？ Why not all, train, val, and test? 为什么不全部训练，验证和测试？ What is the use case for which combination to use? 使用哪种组合的用例是什么？

The next section speficies the ML model and preditcs. 下一节将详细介绍ML模型和前提条件。

model = DecisionTreeRegressor() 
model.fit(train_x, train_y)
val_predictions = model.predict(val_x)

My second question: For the model.predict() statement, why do we put val_x in there? 我的第二个问题：对于model.predict（）语句，为什么要在其中放置val_x？ Don't we want to predict val_y? 我们不是要预测val_y吗？

Bonus Question: Also, in many tutorials I saw how StandardScalers are applied. 奖励问题：另外，在许多教程中，我看到了如何应用StandardScalers。 However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it? 但是，在本教程中，它不是这样出现的，或者是否有其他功能已经对其进行了缩放，而无需明确声明？ Please help. 请帮忙。

Answer 1

1) Validation sets are often used to help you tune hyper-parameters accordingly. 1）验证集通常用于帮助您相应地调整超参数。 Because you may fine tune the model according to its performance on the validation set, the model may become slightly biased to the validation data, even though it isn't directly trained on that data, which is why we keep this separate from the test set. 因为您可以根据模型在验证集中的性能来微调模型，所以即使模型没有直接针对验证数据进行训练，模型也可能会稍微偏向验证数据，这就是我们将其与测试集分开的原因。 Once you tune the model to your liking based on the validation set, you can evaluate it on your test set to see how well it generalizes. 一旦根据验证集对模型进行了调整，就可以在测试集中对其进行评估，以查看其概括性如何。

2) Calling model.predict(val_x) will return the predicted y values based on the given x values. 2）调用model.predict(val_x)将基于给定的x值返回预测的y值。 You can then use some loss function to compare those predicted values with val_y to evaluate the model's performance on your validation set. 然后，您可以使用一些损失函数将这些预测值与val_y进行比较，以评估验证集上模型的性能。

Answer 2

Q 1.1 : Why do we use validation data over test data? 问题1.1 ：为什么我们要使用验证数据而不是测试数据？ (in above scenario) （在上述情况下）

train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)

First of all, the terms validation set and test set are very loosely used in many tutorials and sometimes interchangeably. 首先，术语“验证集”和“测试集”在许多教程中非常宽松地使用，有时可以互换使用。 It is quite possible to call the above val_x, val_y as test_x, test_y 很有可能将上述val_x, val_y称为test_x, test_y

Q 1.2 :Why not all, train, val, and test? 问题1.2 ：为什么不全部训练，验证和测试？ (why the split?) （为什么要拆分？）

All our machine learning algorithms are going to be used on some real-world data (refer actual test data). 我们所有的机器学习算法都将用于某些实际数据（请参阅实际测试数据）。 However after devising an algorithm we want to "test" how well it performs, what is its accuracy, etc. 但是，在设计了算法之后，我们要“测试”它的性能如何，准确性如何等等。

Actually we currently don't have the real world data! 实际上，我们目前没有真实的数据！ Right? 对？

But what do we have? 但是我们有什么呢？ The train data! 火车资料！ so we cleverly put aside a portion of it (splitting) for later testing the algorithm. 因此我们巧妙地将它的一部分（拆分）放在一边，以便以后测试该算法。 The test data is used to evaluate the perform once the model is ready. 模型准备好后，将使用测试数据评估性能。

model = DecisionTreeRegressor() 
model.fit(train_x, train_y)
val_predictions = model.predict(val_x) # contains y values predicted by the model
score = model.score(val_x, val_y) # evaluates predicted y against actual y of test data
print(score)

Q 2. : For the model.predict() statement, why do we put val_x in there? 问题2 ：对于model.predict（）语句，为什么要在其中放置val_x？ Don't we want to predict val_y? 我们不是要预测val_y吗？

Absolutely right we want to predict val_y , but the model needs val_x to predict y. 绝对正确，我们想预测val_y ，但是模型需要val_x来预测y。 That's exactly what we are passing as argument to the predict function. 这正是我们作为预测函数的参数传递的内容。

I understand it might be confusing to read model predict val_x . 我了解读取model predict val_x可能会造成混淆。

So the better way is to interpret it, as model could you u please predict from val_x , and return predicted_y . 所以，更好的方法是解释它，因为model可能你的U请predict从val_x ，并返回predicted_y 。

I say predicted_y and not val_y because, both won't be exactly similar. 我说predicted_y而不是val_y因为，既不会完全相似。 How much they differ? 它们有多少不同？ That is what given by score. 那是分数给出的。

Some Terminologies 一些术语

Data Set : Data in hand. 数据集 ：现有数据。 It is this data that gets divided later 后来是这个数据被分割
Train Set : It is part of Data Set from which our model learns. 训练集 ：它是我们模型学习的数据集的一部分。 Usually large, about 70-80%. 通常较大，约为70-80％。 Commmonly denoted by train_x and train_y. 用train_x和train_y表示的Commmonly。
Test Set : Part of Data Set that we set aside to evaluate the performance of model. 测试集 ：我们保留用来评估模型性能的数据集的一部分。 This "tests" the model hence the name. 这“测试”了模型，因此得名。 Denoted by test_x and test_y. 由test_x和test_y表示。
Validation Set : If we want unbiased estimates of accuracy during the learning process, we use another split of Data Set. 验证集 ：如果我们希望在学习过程中对准确性进行无偏估计，则可以使用另一组数据集。 Usually to find hyperparamters etc. Typically to 通常用于查找超参数等。通常用于
- Pick best-performing algorithm (NB vs DT vs..) 选择性能最佳的算法（NB vs DT vs.）
- Fine-tune parameters (Tree depth, k in kNN, c in SVM) 微调参数（树深度，kNN中的k，SVM中的c）

Q 1.3 : What is the use case for which combination to use? 问题1.3 ：使用哪种组合的用例是什么？

You will always have train & test, or all three. 您将始终接受培训和测试，或者同时参加这三个培训。 However in your case the test is just named as val. 但是，在您的情况下，测试仅被命名为val。

BONUS Question : In many tutorials I saw how StandardScalers are applied. 奖金问题 ：在许多教程中，我看到了如何应用StandardScalers。 However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it? 但是，在本教程中，它不是这样出现的，或者是否有其他功能已经对其进行了缩放，而无需明确声明？

It all depends on your data. 这完全取决于您的数据。 If the data is pre-processed and all scaled properly then StandardScalers need not be applied. 如果对数据进行了预处理且所有数据均已正确缩放，则无需应用StandardScalers。 This particular tutorial just implies that data is already normalised accordingly. 这个特定的教程只是暗示数据已经被相应地标准化了。

model.fit vs model.predict-sklearn中的差异和用法

问题描述

2 个解决方案

解决方案1
2 2019-06-09 23:18:11

解决方案2
1 已采纳 2019-06-09 23:24:27

model.fit vs model.predict-sklearn中的差异和用法

问题描述

2 个解决方案

解决方案1 2 2019-06-09 23:18:11

解决方案2 1 已采纳 2019-06-09 23:24:27

解决方案1
2 2019-06-09 23:18:11

解决方案2
1 已采纳 2019-06-09 23:24:27