[英]What is the difference between fit_transform and transform in sklearn countvectorizer?
I was recently practicing bag of words introduction : kaggle , I want to clear few things :我最近在练习词袋介绍:kaggle ,我想澄清几点:
using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )
使用
vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )
Now when we were preparing the bag of words array on train reviews we used fit_predict
on the list of train reviews , now I know that fit_predict
does two things , first it fits on the data and knows the vocabulary and then it makes vectors on each review .现在,当我们在火车评论中准备单词数组时,我们在火车评论列表中使用了
fit_predict
,现在我知道fit_predict
做了两件事,首先它适合数据并知道词汇量,然后它在每个评论上制作向量。 .
thus when we used vectorizer.transform( "*list of cleaned train reviews* " )
this just transformed the list of test reviews into the vector for each review.因此,当我们使用
vectorizer.transform( "*list of cleaned train reviews* " )
review vectorizer.transform( "*list of cleaned train reviews* " )
这只是将测试评论列表转换为每个评论的向量。
my question is, why not use fit_transform
on the test list too?我的问题是,为什么不在测试列表中使用
fit_transform
呢? I mean in the documents it says it leads to overfitting , but it does make sense to me to use it anyways;我的意思是在文档中它说它会导致过度拟合,但无论如何使用它对我来说确实有意义; let me give you my prospective:
让我给你我的前景:
when we don't use fit_transform
we are essentially saying to make feature vector of test reviews using the most frequent words of train reviews.当我们不使用
fit_transform
我们实际上是在说使用火车评论中最常用的词制作测试评论的特征向量。 Why not make test features array using the most frequent words in the test itself?为什么不使用测试本身中最常用的词来制作测试特征数组?
I mean does random forest care?我的意思是随机森林关心吗? if we give random forest the train feature array and train feature sentiment to work and train itself with and then give it the test feature array won't it just give its prediction on sentiment?
如果我们给随机森林训练特征数组和训练特征情感来工作和训练自己,然后给它测试特征数组,它不会只是给出它对情感的预测吗?
You do not do a fit_transform
on the test data because, when you fit a Random Forest, the Random Forest learns the classification rules based on the values of the features that you provide it.您不对测试数据执行
fit_transform
,因为当您拟合随机森林时,随机森林会根据您提供的特征值学习分类规则。 If these rules are to be applied to classify the test set then you need to make sure that the test features are calculated in the same way using the same vocabulary.如果要应用这些规则对测试集进行分类,则需要确保使用相同的词汇表以相同的方式计算测试特征。 If the vocabulary of the training and the test features is different, then features will not really make sense as they will reflect a vocabulary that is separate from the one the document was trained on.
如果训练和测试特征的词汇不同,那么特征就没有意义,因为它们反映的词汇与训练文档的词汇是分开的。
Now if we specifically talk about CountVectorizer
, then consider the following example, let your training data have the following 3 sentences:现在如果我们专门讲
CountVectorizer
,那么考虑下面的例子,让你的训练数据有以下 3 句话:
Now the vocabulary set for this will be {Dog, is, black, sky, blue, dancing}
.现在为此设置的词汇将是
{Dog, is, black, sky, blue, dancing}
。 Now the Random Forest that you will train will try to learn rules based on the count of these 6 vocabulary terms.现在,您将训练的随机森林将尝试根据这 6 个词汇项的数量来学习规则。 So your features will be vector of length 6. Now if the test set is as follows:
所以你的特征将是长度为 6 的向量。 现在如果测试集如下:
Now if you use the test data for fit_transform
your vocabulary will look like {Dog, white, is, Sky, black}
.现在,如果您使用
fit_transform
的测试数据,您的词汇将类似于{Dog, white, is, Sky, black}
。 So here your each document will be represented by a vector of length 5 denoting the counts of each of these terms.所以在这里你的每个文档将由一个长度为 5 的向量表示,表示这些术语中的每一个的计数。 Now, this will be like comparing apples with oranges.
现在,这就像比较苹果和橙子一样。 You learn rules for counts of the previous vocabulary and those rules can not be applied to this vocabulary.
您学习了先前词汇计数的规则,而这些规则不能应用于此词汇表。 This is the reason why you only
fit
on the training data.这就是为什么你只
fit
训练数据的原因。
Basically you split the whole data into train and test to expose only the train data to the model and other statistical variable calculation like mean and standard deviations, if you expose the test data your model might not be generalized any more and chances of overfit.基本上,您将整个数据拆分为训练和测试,以仅将训练数据暴露给模型和其他统计变量计算(如均值和标准差),如果暴露测试数据,您的模型可能不再泛化,并且可能会过度拟合。 So expose only train data with fit_transform and use the statistical variables to the test data with transform.
因此,仅使用 fit_transform 公开训练数据,并通过变换将统计变量用于测试数据。
In short, fit
is used to train the model, once it's trained you can use that model.简而言之,
fit
用于训练模型,一旦训练完成,您就可以使用该模型。 To use of course you use transform
.要使用当然你使用
transform
。 (Remember fit
generally does calculations or normalization of data). (请记住
fit
通常会进行数据的计算或标准化)。
So you can use fit
and transform
on test data but it's not much wise decision as you duplicate the efforts (Your model is already trained using fit
on train data) as well in long term it may lower the performance too.因此,您可以对测试数据使用
fit
和transform
,但是当您重复工作(您的模型已经使用fit
训练数据进行训练)时,这并不是明智的决定,而且从长远来看,它也可能会降低性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.