简体   繁体   English

当我们使用transform得到相同的output时为什么要使用fit_transform方法

[英]Why should we use the fit_transform method when we get the same output using transform

I don't understand why one has to use the fit_transform method when the transform method can give the same the output as using only fit transform method, whats the whole point of fit method?我不明白为什么必须使用fit_transform方法,当transform方法可以给出与仅使用拟合变换方法相同的 output 时, fit方法的整体点是什么?

I have printed the x_train and x_test , both of them gave similar output.我已经打印了x_trainx_test ,它们都给出了类似的 output。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])

What will happen if you do not call the sc.fit_transform() before sc.transform()?如果在 sc.transform() 之前不调用 sc.fit_transform() 会发生什么? The latter will fail with the message:后者将失败并显示以下消息:

NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

The function fit_transform() does what would fit() followed by transform() would do. function fit_transform() 做了 fit() 后跟 transform() 会做的事情。

You would use fit() alone if you would not be interested in the transformed values of the training set.如果您对训练集的转换值不感兴趣,则可以单独使用 fit()。

So in scickit learn preprocessors you often always have a fit , a transform and a 'fit_transform` method.因此,在 sickit 学习预处理器中,您通常总是有一个fit 、一个transform和一个 'fit_transform' 方法。

The differences are as follow:区别如下:

fit kind of learns the structure of your data to find out categories that exist in it and other preprocessing information. fit learns数据的结构以找出其中存在的类别和其他预处理信息。 Once you have fitted your preprocessor, you can then use that fitted preprocessor to transform your data using that fitting information.安装好预处理器后,您可以使用该安装好的预处理器使用该fitting信息transform数据。 Let's take a simple example:我们举一个简单的例子:

import numpy as np 
from sklearn.preprocessing import StandardScaler

X_train = np.array([[1, 2], [3, 4], [5, 6]])
X_test = np.array([[7, 8], [9, 10]])

X_train:
array([[1, 2],
       [3, 4],
       [5, 6]])

X_test:
array([[ 7,  8],
       [ 9, 10]])

Here you are preparing a standard scaler object在这里,您正在准备标准缩放器 object

sc = StandardScaler()

This object must have some parameters holding information like the mean of the data and so on But since it hasn't yet seen any data, this mean value doesn't exist yet, so the following code is going to shown an error这个 object 必须有一些参数保存数据的平均值等信息但是由于它还没有看到任何数据,所以这个平均值还不存在,所以下面的代码将显示错误

print(sc.mean_)

AttributeError: 'StandardScaler' object has no attribute 'mean_'

Now let's use it to fit X_train data现在让我们用它来拟合 X_train 数据

sc.fit(X_train)

Let's see what happened after this operation让我们看看这个操作之后发生了什么

print(sc.mean_)

[3. 4.]

Now we can see that our standard scaler object has computed the mean of the data he's seen and stored it in one of its attributes which is here mean_现在我们可以看到我们的标准缩放器 object 已经计算了他所看到的数据的平均值并将其存储在它的属性之一中,这里是mean_

So this is basically to role of the fit method: it is to find parameters about some data, in our case it is the training data.所以这基本上是fit方法的作用:它是找到一些数据的参数,在我们的例子中是训练数据。 Why we would want to find those parameters first is because we might want to reuse them exactly to transform other data.为什么我们要首先找到这些参数是因为我们可能希望完全重用它们来转换其他数据。 That's where comes in the transform method.这就是transform方法的用武之地。

The transform method uses the 'learned' parameters of some previous data to transform some new data. transform 方法使用一些先前数据的'learned'参数来转换一些新数据。 So that in our case we can now transform our test data.所以在我们的例子中,我们现在可以转换我们的测试数据。 This is because the train an test data should be transformed the same way( with the same parameters like mean, etc)这是因为训练测试数据应该以相同的方式进行转换(使用相同的参数,如均值等)

sc.transform(X_test)

array([[2.44949 , 2.44949 ],
       [3.674235, 3.674235]])

But ofcourse we should also transform the training data itself first !但是当然我们也应该首先转换训练数据本身!

sc.transform(X_train)

array([[-1.224745, -1.224745],
       [ 0.      ,  0.      ],
       [ 1.224745,  1.224745]])

As you can notice, we have fitted then transformed our training data in a row, while we have only transformed our test data without the need to fit it.如您所见,我们已经连续fitted然后transformed了我们的训练数据,而我们只transformed了我们的测试数据而不需要拟合它。 Fitting and transforming in a row is where the fit_transform method comes in. So that for the training data we can directly do:连续拟合和转换是fit_transform方法的用武之地。因此对于训练数据,我们可以直接执行以下操作:

X_train = sc.fit_transform(X_train)

array([[-1.224745, -1.224745],
       [ 0.      ,  0.      ],
       [ 1.224745,  1.224745]])

This method fits the data then tranforms it.此方法适合数据然后对其进行转换。 But you can't just transform data without having fit it.但是你不能只转换数据而不适合它。 Now that you have already fitted your training data using fit_transform or just fit , now you can just transform your test data with the same fitting information as for the training data.现在您已经使用fit_transform或 just fit拟合了您的训练数据,现在您可以使用与训练数据相同的拟合信息转换您的测试数据。

Hope this was clear enough.希望这已经足够清楚了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM