简体   繁体   English

TF-IDF 向量示例(帮助)

[英]TF-IDF Vectors Example (HELP)

Hey i made 3 different approaches but i can't decide which is the right way to use TF-IDF:嘿,我做了 3 种不同的方法,但我无法决定哪种是使用 TF-IDF 的正确方法:

The first code does fit and transform to both x_train and x_test separately giving (5000, 94462) (5000, 93007).第一个代码确实适合并转换为 x_train 和 x_test 分别给出 (5000, 94462) (5000, 93007)。

The second code uses both train and test which i think is not right because idf is calculated based on the training documents only, giving (5000, 152800) (5000, 152800).第二个代码同时使用训练和测试,我认为这是不正确的,因为 idf 仅根据训练文档计算,给出 (5000, 152800) (5000, 152800)。

The third code gives (5000, 94462) (5000, 94462).第三个代码给出 (5000, 94462) (5000, 94462)。

For me the third code is right because i used train data only and transform test data based on them.对我来说,第三个代码是正确的,因为我只使用了训练数据并根据它们转换了测试数据。

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.fit_transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train+x_test)
xtrain_tfidf = vectorizer.transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(x_train)
x_test_vectorized = vect.transform(x_test)

The right way is to fit and transform == fit_transform your training data and only transform test data.正确的方法是fittransform == fit_transform你的训练数据,transform测试数据。


from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)

You never fit_transform test data.永远不会fit_transform测试数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM