简体   繁体   English

使用 fit_transform() 和 transform()

[英]Using fit_transform() and transform()

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

What I know is fit() method calculates mean and standard deviation of the feature and then transform() method uses them to transform the feature into a new scaled feature.我所知道的是fit()方法计算特征的均值和标准差,然后transform()方法使用它们将特征转换为新的缩放特征。 fit_transform() is nothing but calling fit() & transform() method in a single line. fit_transform()只不过是在一行中调用fit()transform()方法。

But here why are we only calling fit() for training data and not for testing data??但是这里为什么我们只为训练数据调用fit()而不是测试数据呢?

Does that means we are using mean & standard deviation of training data to transform our testing data??这是否意味着我们正在使用训练数据的均值和标准差来转换我们的测试数据?

fit computes the mean and stdev to be used for later scaling, note it's just a computation with no scaling done. fit计算用于以后缩放的平均值和标准差,注意它只是一个没有缩放的计算。

transform uses the previously computed mean and stdev to scale the data (subtract mean from all values and then divide it by stdev). transform使用先前计算的均值和标准差来缩放数据(从所有值中减去均值,然后除以标准差)。

fit_transform does both at the same time. fit_transform同时进行。 So you can do it with just 1 line of code.因此,您只需 1 行代码即可完成。

For X_train dataset, we do fit_transform because we need to compute mean and stdev, and then use it to scale the X_train dataset.对于X_train数据集,我们做fit_transform因为我们需要计算均值和标准差,然后用它来缩放X_train数据集。 For X_test dataset, since we already have the mean and stdev, we only do the transformation part.对于X_test数据集,由于我们已经有了均值和标准差,我们只做转换部分。

Edit : X_test data should be totally unseen and unknown (ie, no info is extracted from them), so we can only derive info from X_train .编辑X_test数据应该是完全看不见未知的(即,没有从中提取信息),所以我们只能从X_train中获取信息。 The reason why we apply the derived mean and stdev (from X_train ) to transform X_test as well, is to have the same "apple-to-apple" comparison for y_test and y_pred .我们应用派生均值和标准差(来自X_train )来转换X_test的原因是对y_testy_pred进行相同的“苹果对苹果”比较。

By the way, if the train/test data is split properly without bias, and that the data is sufficiently large, both datasets would have the same approximation to the population mean and stdev.顺便说一句,如果训练/测试数据被正确分割而没有偏差,并且数据足够大,那么两个数据集对总体均值和标准差的近似值将相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM