简体繁体中英

Computing TF-IDF on the whole dataset or only on training data?

原文 2017-12-12 17:34:21 1 3 python/ machine-learning/ scikit-learn/ nlp/ tf-idf

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?

3 answers

According to the documentation of scikit-learn, fit() is used in order to

Learn vocabulary and idf from training set.

On the other hand, fit_transform() is used in order to

Learn vocabulary and idf, return term-document matrix.

while transform()

Transforms documents to document-term matrix.

On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (ie the documents).

Remember that training sets are used for learning purposes (learning is achieved through fit() ) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.

For more details you can refer to the article fit() vs transform() vs fit_transform()

Author gives all text data before separating train and test to function. Is it a true action or we must separate data first then perform tfidf fit_transform on train and transform on test?

I would consider this as already leaking some information about the test set into the training set.

I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.

As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.
Hence we have to use fit_transform on the training data and transform on the test data. If you think about doing cross validation, then you can use this logic across all the folds.

How to apply tf-idf to whole dataset (training and testing dataset) instead of only training dataset within naive bayes classifier class?

Persist Tf-Idf data

how to compute TF-IDF on a specific dataset

how to compute TF-IDF on dataset?

TF-IDF function

TF-IDF by string line rather than whole text document

Calculate tf-idf weight for only given word list with sklearn

TF-IDF how to takes only a list of words

Using tf.dataset API in training cant get the whole data

TF-IDF Matrix In Python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to apply tf-idf to whole dataset (training and testing dataset) instead of only training dataset within naive bayes classifier class? Persist Tf-Idf data how to compute TF-IDF on a specific dataset how to compute TF-IDF on dataset? TF-IDF function TF-IDF by string line rather than whole text document Calculate tf-idf weight for only given word list with sklearn TF-IDF how to takes only a list of words Using tf.dataset API in training cant get the whole data TF-IDF Matrix In Python

Related Tags

Computing TF-IDF on the whole dataset or only on training data?

Question

3 answers

solution1
19 ACCPTED 2017-12-12 21:03:48

solution2
3 2019-03-22 13:41:58

solution3
1 2019-08-09 21:01:44

Computing TF-IDF on the whole dataset or only on training data?

Question

3 answers

solution1 19 ACCPTED 2017-12-12 21:03:48

solution2 3 2019-03-22 13:41:58

solution3 1 2019-08-09 21:01:44

solution1
19 ACCPTED 2017-12-12 21:03:48

solution2
3 2019-03-22 13:41:58

solution3
1 2019-08-09 21:01:44