简体繁体 English

在整个数据集上计算 TF-IDF 还是仅在训练数据上计算？

[英]Computing TF-IDF on the whole dataset or only on training data?

原文 2017-12-12 17:34:21 2 3 python/ machine-learning/ scikit-learn/ nlp/ tf-idf

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training.在本书的第七章“TensorFlow Machine Learning Cookbook”中，作者在数据预处理中使用了scikit-learn的fit_transform函数来获取tfidf fit_transform特征进行训练。 The author gives all text data to the function before separating it into train and test.作者在将其分离为训练和测试之前将所有文本数据提供给函数。 Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?这是一个真正的动作，还是我们必须先分离数据，然后在fit_transform上执行fit_transform并在测试上进行transform ？

3 个解决方案

According to the documentation of scikit-learn, fit() is used in order to根据 scikit-learn 的文档， fit()用于

Learn vocabulary and idf from training set.从训练集中学习词汇和 idf。

On the other hand, fit_transform() is used in order to另一方面， fit_transform()用于

Learn vocabulary and idf, return term-document matrix.学习词汇和idf，返回term-document矩阵。

while transform()而transform()

Transforms documents to document-term matrix.将文档转换为文档-术语矩阵。

On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (ie the documents).在训练集上，您需要同时应用fit()和transform() （或只是fit_transform()本质上连接这两个操作），但是，在测试集上，您只需要transform()测试实例（即文档）。

Remember that training sets are used for learning purposes (learning is achieved through fit() ) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.请记住，训练集用于学习目的（学习是通过fit()实现的），而测试集用于评估训练后的模型是否可以很好地泛化到新的未知数据点。

For more details you can refer to the article fit() vs transform() vs fit_transform()更多细节可以参考文章fit() vs transform() vs fit_transform()

Author gives all text data before separating train and test to function.作者在分离训练和测试之前给出了所有文本数据以发挥作用。 Is it a true action or we must separate data first then perform tfidf fit_transform on train and transform on test?这是一个真正的动作，还是我们必须先分离数据，然后在训练上执行 tfidf fit_transform 并在测试上进行转换？

I would consider this as already leaking some information about the test set into the training set.我认为这已经将有关测试集的一些信息泄漏到了训练集中。

I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.我倾向于始终遵循这样的规则，即在任何预处理之前要做的第一件事是分离数据，创建一个保留集。

As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.当我们谈论文本数据时，我们必须确保模型仅在训练集的词汇表上进行训练，因为当我们在现实生活中部署模型时，它会遇到它以前从未见过的单词，所以我们有牢记这一点，对测试集进行验证。
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.我们必须确保测试集中的新词不是模型词汇表的一部分。
Hence we have to use fit_transform on the training data and transform on the test data.因此我们必须在训练数据上使用 fit_transform 并在测试数据上进行转换。 If you think about doing cross validation, then you can use this logic across all the folds.如果您考虑进行交叉验证，那么您可以在所有折叠中使用此逻辑。