简体   繁体   English

sklearn中的'transform'和'fit_transform'有什么区别

[英]what is the difference between 'transform' and 'fit_transform' in sklearn

In the sklearn-python toolbox, there are two functions transform and fit_transform about sklearn.decomposition.RandomizedPCA .在 sklearn-python 工具箱中,有两个函数transformfit_transform是关于sklearn.decomposition.RandomizedPCA The description of two functions are as follows两个函数的说明如下

在此处输入图像描述在此处输入图像描述

But what is the difference between them?但是它们之间有什么区别呢?

In scikit-learn estimator api ,scikit-learn estimator api 中

fit() : used for generating learning model parameters from training data fit() :用于从训练数据生成学习模型参数

transform() : parameters generated from fit() method,applied upon model to generate transformed data set. transform() :由fit()方法生成的参数,应用于模型以生成转换后的数据集。

fit_transform() : combination of fit() and transform() api on same data set fit_transform() : fit()transform() api 在同一数据集上的组合

在此处输入图片说明

Checkout Chapter-4 from this book & answer from stackexchange for more clarity从这本书中查看第 4 章并从stackexchange 中回答以获得更清晰的信息

These methods are used to center/feature scale of a given data.这些方法用于给定数据的中心/特征尺度。 It basically helps to normalize the data within a particular range它基本上有助于规范特定范围内的数据

For this, we use Z-score method.为此,我们使用 Z-score 方法。

Z-分数

We do this on the training set of data.我们在训练数据集上这样做。

1. Fit(): Method calculates the parameters μ and σ and saves them as internal objects. 1. Fit():方法计算参数μ和σ并保存为内部对象。

2. Transform(): Method using these calculated parameters apply the transformation to a particular dataset. 2. Transform():使用这些计算参数的方法将转换应用于特定数据集。

3. Fit_transform(): joins the fit() and transform() method for transformation of dataset. 3. Fit_transform():结合fit()和transform()方法对数据集进行变换。

Code snippet for Feature Scaling/Standardisation(after train_test_split).特征缩放/标准化的代码片段(在 train_test_split 之后)。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

We apply the same(training set same two parameters μ and σ (values)) parameter transformation on our testing set.我们在测试集上应用相同的(训练集相同的两个参数 μ 和 σ(值))参数转换。

The .transform method is meant for when you have already computed PCA , ie if you have already called its .fit method. .transform方法适用于您已经计算出PCA ,即如果您已经调用了它的.fit方法。

In [12]: pc2 = RandomizedPCA(n_components=3)

In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-e3b6b8ea2aff> in <module>()
----> 1 pc2.transform(X)

/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
    714         # XXX remove scipy.sparse support here in 0.16
    715         X = atleast2d_or_csr(X)
--> 716         if self.mean_ is not None:
    717             X = X - self.mean_
    718 

AttributeError: 'RandomizedPCA' object has no attribute 'mean_'

In [14]: pc2.ftransform(X) 
pc2.fit            pc2.fit_transform  

In [14]: pc2.fit_transform(X)
Out[14]: 
array([[-1.38340578, -0.2935787 ],
       [-2.22189802,  0.25133484],
       [-3.6053038 , -0.04224385],
       [ 1.38340578,  0.2935787 ],
       [ 2.22189802, -0.25133484],
       [ 3.6053038 ,  0.04224385]])
    
  

So you want to fit RandomizedPCA and then transform as:所以你想fit RandomizedPCA然后transform为:

In [20]: pca = RandomizedPCA(n_components=3)

In [21]: pca.fit(X)
Out[21]: 
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
       whiten=False)

In [22]: pca.transform(z)
Out[22]: 
array([[ 2.76681156,  0.58715739],
       [ 1.92831932,  1.13207093],
       [ 0.54491354,  0.83849224],
       [ 5.53362311,  1.17431479],
       [ 6.37211535,  0.62940125],
       [ 7.75552113,  0.92297994]])

In [23]: 

In particular PCA .transform applies the change of basis obtained through the PCA decomposition of the matrix X to the matrix Z .特别是 PCA .transform将通过矩阵X的 PCA 分解获得的基变化应用于矩阵Z

Why and When use each one of fit() , transform() , fit_transform()为什么以及何时使用fit()transform()fit_transform()每一个

Usually we have a supervised learning problem with (X, y) as our dataset, and we split it into training data and test data:通常我们有一个以 (X, y) 作为数据集的监督学习问题,我们将其拆分为训练数据和测试数据:

import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train_vectorized = model.fit_transform(X_train)
X_test_vectorized = model.transform(X_test)

Imagine we are fitting a tokenizer, if we fit X we are including testing data into the tokenizer, but I have seen this error many times!想象一下,我们正在拟合一个分词器,如果我们拟合 X,我们会将测试数据包含到分词器中,但我已经多次看到这个错误!

The correct is to fit ONLY with X_train , because you don't know "your future data" so you cannot use X_test data for fitting anything!正确的是只适合 X_train ,因为你不知道“你未来的数据”,所以你不能使用 X_test 数据来拟合任何东西!

Then you can transform your test data, but separately, that's why there are different methods.然后你可以转换你的测试数据,但分开,这就是为什么有不同的方法。

Final tip: X_train_transformed = model.fit_transform(X_train) is equivalent to: X_train_transformed = model.fit(X_train).transform(X_train) , but the first one is faster.最后提示: X_train_transformed = model.fit_transform(X_train)等价于: X_train_transformed = model.fit(X_train).transform(X_train) ,但第一个更快。

Note that what I call "model" usually will be a scaler, a tfidf transformer, other kind of vectorizer, a tokenizer...请注意,我所说的“模型”通常是缩放器、tfidf 转换器、其他类型的矢量化器、标记器......

Remember: X represents the features and y represents the label of each sample.请记住:X 代表特征,y 代表每个样本的标签。 X is a dataframe and y is a pandas Series object (usually) X 是一个数据框, y 是一个熊猫系列对象(通常)

Generic difference between the methods:方法之间的一般区别:

  • fit (raw_documents[, y]): Learn a vocabulary dictionary of all tokens in the raw documents. fit (raw_documents[, y]):学习原始文档中所有标记的词汇字典。
  • fit_transform (raw_documents[, y]): Learn the vocabulary dictionary and return term-document matrix. fit_transform (raw_documents[, y]):学习词汇字典并返回term-document矩阵。 This is equivalent to fit followed by the transform, but more efficiently implemented.这等效于 fit 后跟变换,但更有效地实现。
  • transform (raw_documents): Transform documents to document-term matrix.转换(raw_documents):将文档转换为文档项矩阵。 Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.使用适合的词汇表或提供给构造函数的词汇表从原始文本文档中提取标记计数。

Both fit_transform and transform returns the same, Document-term matrix. fit_transform 和 transform 都返回相同的 Document-term 矩阵。

Source 来源

In layman's terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values).通俗地说,fit_transform 的意思是先做一些计算,然后再做转换(比如从一些数据中计算列的均值,然后替换缺失的值)。 So for training set, you need to both calculate and do transformation.所以对于训练集,你需要计算和转换。

But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn't need to calculate, it just performs the transformation.但是对于测试集,机器学习根据在训练集中学到的东西应用预测,因此不需要计算,它只执行转换。

Here the basic difference between .fit() & .fit_transform() :这里.fit().fit_transform()之间的基本区别:

.fit() is used in the Supervised learning having two object/parameter (x,y) to fit model and make model to run, where we know that what we are going to predict .fit()用于监督学习,有两个对象/参数 (x,y) 来拟合模型并使模型运行,我们知道我们要预测什么

.fit_transform() is used in Unsupervised Learning having one object/parameter(x), where we don't know, what we are going to predict. .fit_transform()用于具有一个对象/参数 (x) 的无监督学习,我们不知道我们要预测什么。

When we have two Arrays with different elements we use 'fit' and transform separately, we fit 'array 1' base on its internal function such as in MinMaxScaler (internal function is to find mean and standard deviation).当我们有两个具有不同元素的数组时,我们分别使用“拟合”和变换,我们根据其内部函数拟合“数组 1”,例如MinMaxScaler (内部函数用于查找均值和标准差)。 For example, if we fit 'array 1' based on its mean and transform array 2, then the mean of array 1 will be applied to array 2 which we transformed.例如,如果我们根据其平均值拟合“数组 1”并变换数组 2,则数组 1 的平均值将应用于我们变换的数组 2。 In simple words, we transform one array on the basic internal functions of another array.简单来说,我们将一个数组转换为另一个数组的基本内部函数。

Code demonstration:代码演示:

import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

temperature = [32., np.nan, 28., np.nan, 32., np.nan, np.nan, 34., 40.]
 windspeed  = [ 6.,  9., np.nan,  7., np.nan, np.nan, np.nan,  8., 12.]
n_arr_1 = np.array(temperature).reshape(3,3)
print('temperature:\n',n_arr_1)
n_arr_2 = np.array(windspeed).reshape(3,3)
print('windspeed:\n',n_arr_2)

Output:输出:

temperature:
 [[32. nan 28.]
 [nan 32. nan]
 [nan 34. 40.]]
windspeed:
 [[ 6.  9. nan]
 [ 7. nan nan]
 [nan  8. 12.]]

fit and transform seperately, transforming array 2 for fitted (based on mean) array 1: fittransform ,将数组 2 转换为拟合(基于均值)数组 1:

imp.fit(n_arr_1)
imp.transform(n_arr_2)

Output输出

Check the output below, observe the output based on previos two output you will see the differrence.检查下面的输出,观察基于前两个输出的输出你会看到差异。 Basically, on Array 1 it is taking mean of every column and fitting in array 2 according to its column where ever missing value is missed.基本上,在数组 1 上,它取每一列的平均值,并根据其缺失值的列拟合数组 2。

array([[ 6.,  9., 34.],
       [ 7., 33., 34.],
       [32.,  8., 12.]])

This is we doing when we want to transform one array based on another array.当我们想根据另一个数组转换一个数组时,我们就是这样做的。 but when we have an single array and we want to transform it based on its own mean.但是当我们有一个数组并且我们想根据它自己的平均值对其进行转换时。 In this condition, we use fit_transform together.在这种情况下,我们一起使用fit_transform

See below;见下文;

imp.fit_transform(n_arr_2)

Output输出

array([[ 6. ,  9. , 12. ],
       [ 7. ,  8.5, 12. ],
       [ 6.5,  8. , 12. ]])

(Above) Alternativily we doing: (上图)或者我们在做:

imp.fit(n_arr_2)
imp.transform(n_arr_2)

Output输出

array([[ 6. ,  9. , 12. ],
       [ 7. ,  8.5, 12. ],
       [ 6.5,  8. , 12. ]])

Why we fitting and transforming the the same array seperatly, it takes two line code, why don't we use simple fit_transform which can fit and transform the same array in one line code.为什么我们分别对同一个数组进行拟合和转换,需要两行代码,为什么不使用简单的fit_transform,它可以在一行代码中对同一个数组进行拟合和转换。 That's what differrence is between fit and transform and fit_transform.这就是 fit 和 transform 和 fit_transform 之间的区别。

Below answer is applicable for any kind of sklearn related lib.以下答案适用于任何类型的 sklearn 相关库。 Before knowing about fit_transform , let's see what the fit method is:在了解fit_transform之前,让我们看看 fit 方法是什么:

fit(X) - Fit the model with X by extracting the first principal components. fit(X) - 通过提取第一个主成分,用 X 拟合 model。

fit_transform(X) - Fit the model with X and apply the dimensionality reduction on X. fit_transform(X) - 用 X 拟合 model 并对 X 应用降维。

fit_transform ---> fit(x).transform(x) fit_transform ---> fit(x).transform(x)

transform(x) - Apply dimensionality reduction on X. transform(x) - 对 X 应用降维。

You can see sklearn randomized PCA doc here for further details.您可以在此处查看 sklearn randomized PCA 文档以获取更多详细信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM