[英]model.fit vs model.predict - differences & usage in sklearn
I am new to ML with Python and was trying my first attempt via a tutorial. 我不熟悉使用Python进行ML,并且正在尝试通过教程进行第一次尝试。 In that tutorial, there are some lines of code that I am having difficulties understanding how they interact with each other. 在该教程中,有些代码行很难理解它们之间的交互方式。
First code is the splitting of the data occurred as the following: 第一个代码是发生的数据拆分,如下所示:
train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)
My first question: Why do we use validation data over test data? 我的第一个问题:为什么我们使用验证数据而不是测试数据? Why not all, train, val, and test? 为什么不全部训练,验证和测试? What is the use case for which combination to use? 使用哪种组合的用例是什么?
The next section speficies the ML model and preditcs. 下一节将详细介绍ML模型和前提条件。
model = DecisionTreeRegressor()
model.fit(train_x, train_y)
val_predictions = model.predict(val_x)
My second question: For the model.predict() statement, why do we put val_x in there? 我的第二个问题:对于model.predict()语句,为什么要在其中放置val_x? Don't we want to predict val_y? 我们不是要预测val_y吗?
Bonus Question: Also, in many tutorials I saw how StandardScalers are applied. 奖励问题:另外,在许多教程中,我看到了如何应用StandardScalers。 However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it? 但是,在本教程中,它不是这样出现的,或者是否有其他功能已经对其进行了缩放,而无需明确声明? Please help. 请帮忙。
1) Validation sets are often used to help you tune hyper-parameters accordingly. 1)验证集通常用于帮助您相应地调整超参数。 Because you may fine tune the model according to its performance on the validation set, the model may become slightly biased to the validation data, even though it isn't directly trained on that data, which is why we keep this separate from the test set. 因为您可以根据模型在验证集中的性能来微调模型,所以即使模型没有直接针对验证数据进行训练,模型也可能会稍微偏向验证数据,这就是我们将其与测试集分开的原因。 Once you tune the model to your liking based on the validation set, you can evaluate it on your test set to see how well it generalizes. 一旦根据验证集对模型进行了调整,就可以在测试集中对其进行评估,以查看其概括性如何。
2) Calling model.predict(val_x)
will return the predicted y values based on the given x values. 2)调用model.predict(val_x)
将基于给定的x值返回预测的y值。 You can then use some loss function to compare those predicted values with val_y
to evaluate the model's performance on your validation set. 然后,您可以使用一些损失函数将这些预测值与val_y
进行比较,以评估验证集上模型的性能。
Q 1.1 : Why do we use validation data over test data? 问题1.1 :为什么我们要使用验证数据而不是测试数据? (in above scenario) (在上述情况下)
train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)
First of all, the terms validation set and test set are very loosely used in many tutorials and sometimes interchangeably. 首先,术语“验证集”和“测试集”在许多教程中非常宽松地使用,有时可以互换使用。 It is quite possible to call the above
val_x, val_y
astest_x, test_y
很有可能将上述val_x, val_y
称为test_x, test_y
Q 1.2 :Why not all, train, val, and test? 问题1.2 :为什么不全部训练,验证和测试? (why the split?) (为什么要拆分?)
All our machine learning algorithms are going to be used on some real-world data (refer actual test data). 我们所有的机器学习算法都将用于某些实际数据(请参阅实际测试数据)。 However after devising an algorithm we want to "test" how well it performs, what is its accuracy, etc. 但是,在设计了算法之后,我们要“测试”它的性能如何,准确性如何等等。
Actually we currently don't have the real world data! 实际上,我们目前没有真实的数据! Right? 对?
But what do we have? 但是我们有什么呢? The train data! 火车资料! so we cleverly put aside a portion of it (splitting) for later testing the algorithm. 因此我们巧妙地将它的一部分(拆分)放在一边,以便以后测试该算法。 The test data is used to evaluate the perform once the model is ready. 模型准备好后,将使用测试数据评估性能。
model = DecisionTreeRegressor()
model.fit(train_x, train_y)
val_predictions = model.predict(val_x) # contains y values predicted by the model
score = model.score(val_x, val_y) # evaluates predicted y against actual y of test data
print(score)
Q 2. : For the model.predict() statement, why do we put val_x in there? 问题2 :对于model.predict()语句,为什么要在其中放置val_x? Don't we want to predict val_y? 我们不是要预测val_y吗?
Absolutely right we want to predict val_y
, but the model needs val_x
to predict y. 绝对正确,我们想预测val_y
,但是模型需要val_x
来预测y。 That's exactly what we are passing as argument to the predict function. 这正是我们作为预测函数的参数传递的内容。
I understand it might be confusing to read
model
predict
val_x
. 我了解读取model
predict
val_x
可能会造成混淆。So the better way is to interpret it, as
model
could you u pleasepredict
fromval_x
, and returnpredicted_y
. 所以,更好的方法是解释它,因为model
可能你的U请predict
从val_x
,并返回predicted_y
。
I say predicted_y
and not val_y
because, both won't be exactly similar. 我说predicted_y
而不是val_y
因为,既不会完全相似。 How much they differ? 它们有多少不同? That is what given by score. 那是分数给出的。
Some Terminologies 一些术语
Q 1.3 : What is the use case for which combination to use? 问题1.3 :使用哪种组合的用例是什么?
You will always have train & test, or all three. 您将始终接受培训和测试,或者同时参加这三个培训。 However in your case the test is just named as val. 但是,在您的情况下,测试仅被命名为val。
BONUS Question : In many tutorials I saw how StandardScalers are applied. 奖金问题 :在许多教程中,我看到了如何应用StandardScalers。 However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it? 但是,在本教程中,它不是这样出现的,或者是否有其他功能已经对其进行了缩放,而无需明确声明?
It all depends on your data. 这完全取决于您的数据。 If the data is pre-processed and all scaled properly then StandardScalers need not be applied. 如果对数据进行了预处理且所有数据均已正确缩放,则无需应用StandardScalers。 This particular tutorial just implies that data is already normalised accordingly. 这个特定的教程只是暗示数据已经被相应地标准化了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.