简体繁体 English

sklearn countvectorizer 中的 fit_transform 和 transform 有什么区别？

[英]What is the difference between fit_transform and transform in sklearn countvectorizer?

原文 2016-08-01 06:46:27 4 3 python/ scikit-learn/ tokenize/ text-processing

I was recently practicing bag of words introduction : kaggle , I want to clear few things :我最近在练习词袋介绍：kaggle ，我想澄清几点：

using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )使用vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )

Now when we were preparing the bag of words array on train reviews we used fit_predict on the list of train reviews , now I know that fit_predict does two things , first it fits on the data and knows the vocabulary and then it makes vectors on each review .现在，当我们在火车评论中准备单词数组时，我们在火车评论列表中使用了fit_predict ，现在我知道fit_predict做了两件事，首先它适合数据并知道词汇量，然后它在每个评论上制作向量。 .

thus when we used vectorizer.transform( "*list of cleaned train reviews* " ) this just transformed the list of test reviews into the vector for each review.因此，当我们使用vectorizer.transform( "*list of cleaned train reviews* " ) review vectorizer.transform( "*list of cleaned train reviews* " )这只是将测试评论列表转换为每个评论的向量。

my question is, why not use fit_transform on the test list too?我的问题是，为什么不在测试列表中使用fit_transform呢？ I mean in the documents it says it leads to overfitting , but it does make sense to me to use it anyways;我的意思是在文档中它说它会导致过度拟合，但无论如何使用它对我来说确实有意义； let me give you my prospective:让我给你我的前景：

when we don't use fit_transform we are essentially saying to make feature vector of test reviews using the most frequent words of train reviews.当我们不使用fit_transform我们实际上是在说使用火车评论中最常用的词制作测试评论的特征向量。 Why not make test features array using the most frequent words in the test itself?为什么不使用测试本身中最常用的词来制作测试特征数组？

I mean does random forest care?我的意思是随机森林关心吗？ if we give random forest the train feature array and train feature sentiment to work and train itself with and then give it the test feature array won't it just give its prediction on sentiment?如果我们给随机森林训练特征数组和训练特征情感来工作和训练自己，然后给它测试特征数组，它不会只是给出它对情感的预测吗？

3 个解决方案

You do not do a fit_transform on the test data because, when you fit a Random Forest, the Random Forest learns the classification rules based on the values of the features that you provide it.您不对测试数据执行fit_transform ，因为当您拟合随机森林时，随机森林会根据您提供的特征值学习分类规则。 If these rules are to be applied to classify the test set then you need to make sure that the test features are calculated in the same way using the same vocabulary.如果要应用这些规则对测试集进行分类，则需要确保使用相同的词汇表以相同的方式计算测试特征。 If the vocabulary of the training and the test features is different, then features will not really make sense as they will reflect a vocabulary that is separate from the one the document was trained on.如果训练和测试特征的词汇不同，那么特征就没有意义，因为它们反映的词汇与训练文档的词汇是分开的。

Now if we specifically talk about CountVectorizer , then consider the following example, let your training data have the following 3 sentences:现在如果我们专门讲CountVectorizer ，那么考虑下面的例子，让你的训练数据有以下 3 句话：

Dog is black.狗是黑色的。
Sky is blue.天空是蓝色的。
Dog is dancing.狗在跳舞。

Now the vocabulary set for this will be {Dog, is, black, sky, blue, dancing} .现在为此设置的词汇将是{Dog, is, black, sky, blue, dancing} 。 Now the Random Forest that you will train will try to learn rules based on the count of these 6 vocabulary terms.现在，您将训练的随机森林将尝试根据这 6 个词汇项的数量来学习规则。 So your features will be vector of length 6. Now if the test set is as follows:所以你的特征将是长度为 6 的向量。现在如果测试集如下：

Dog is white.狗是白色的。
Sky is black.天空是黑色的。

Now if you use the test data for fit_transform your vocabulary will look like {Dog, white, is, Sky, black} .现在，如果您使用fit_transform的测试数据，您的词汇将类似于{Dog, white, is, Sky, black} 。 So here your each document will be represented by a vector of length 5 denoting the counts of each of these terms.所以在这里你的每个文档将由一个长度为 5 的向量表示，表示这些术语中的每一个的计数。 Now, this will be like comparing apples with oranges.现在，这就像比较苹果和橙子一样。 You learn rules for counts of the previous vocabulary and those rules can not be applied to this vocabulary.您学习了先前词汇计数的规则，而这些规则不能应用于此词汇表。 This is the reason why you only fit on the training data.这就是为什么你只fit训练数据的原因。

Basically you split the whole data into train and test to expose only the train data to the model and other statistical variable calculation like mean and standard deviations, if you expose the test data your model might not be generalized any more and chances of overfit.基本上，您将整个数据拆分为训练和测试，以仅将训练数据暴露给模型和其他统计变量计算（如均值和标准差），如果暴露测试数据，您的模型可能不再泛化，并且可能会过度拟合。 So expose only train data with fit_transform and use the statistical variables to the test data with transform.因此，仅使用 fit_transform 公开训练数据，并通过变换将统计变量用于测试数据。

In short, fit is used to train the model, once it's trained you can use that model.简而言之， fit用于训练模型，一旦训练完成，您就可以使用该模型。 To use of course you use transform .要使用当然你使用transform 。 (Remember fit generally does calculations or normalization of data). （请记住fit通常会进行数据的计算或标准化）。

So you can use fit and transform on test data but it's not much wise decision as you duplicate the efforts (Your model is already trained using fit on train data) as well in long term it may lower the performance too.因此，您可以对测试数据使用fit和transform ，但是当您重复工作（您的模型已经使用fit训练数据进行训练）时，这并不是明智的决定，而且从长远来看，它也可能会降低性能。