[英]TF-IDF Vectors Example (HELP)
Hey i made 3 different approaches but i can't decide which is the right way to use TF-IDF:嘿,我做了 3 种不同的方法,但我无法决定哪种是使用 TF-IDF 的正确方法:
The first code does fit and transform to both x_train and x_test separately giving (5000, 94462) (5000, 93007).第一个代码确实适合并转换为 x_train 和 x_test 分别给出 (5000, 94462) (5000, 93007)。
The second code uses both train and test which i think is not right because idf is calculated based on the training documents only, giving (5000, 152800) (5000, 152800).第二个代码同时使用训练和测试,我认为这是不正确的,因为 idf 仅根据训练文档计算,给出 (5000, 152800) (5000, 152800)。
The third code gives (5000, 94462) (5000, 94462).第三个代码给出 (5000, 94462) (5000, 94462)。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.fit_transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(x_train+x_test)
xtrain_tfidf = vectorizer.transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect.fit(x_train)
x_test_vectorized = vect.transform(x_test)
The right way is to fit
and transform
== fit_transform
your training data and only transform
test data.正确的方法是
fit
和transform
== fit_transform
你的训练数据,只transform
测试数据。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
xtrain_tfidf = vectorizer.fit_transform(x_train)
xtest_tfidf = vectorizer.transform(x_test)
print(xtrain_tfidf.shape)
print(xtest_tfidf.shape)
You never fit_transform
test data.你永远不会
fit_transform
测试数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.