简体   繁体   中英

Which way is right in tf-idf? Fit all then transform train set and test set or fit train set then transform test set

1.Fit train set then transform test set scikit-learn provide this example

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)

2.Fit all then transform train set and test set which I've seen in many cases

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_all = np.append(train_x, test_x, axis=0)
vectorizer.fit(X_all)
X_train = vectorizer.transform(train_x)
X_test = vectorizer.transform(test_x)

So, I'm confused which way is right and why

It really depends on your use case.

In the first situation, your test set TF-IDF values are only based on the frequencies in the train set. This allows you to control the "reference" corpus and decorrelates your results to data in the testing set which makes sense when data in your test set is sampled from a data distribution that is very different from what you could expect in a normal situation. Note that this only works because scikit implements TF-IDF in a way that is robust to previously unseen words.

In the second situation, when you use the test set for training, your frequencies are also going to be based on what is in your test set. This allows for more representative frequency values for data in your test set domain which can lead to performance improvements on your downstream task, and also ensures no new unseen words appear at test time.

tl;dr both work

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM