TF-IDF矢量化器，用于多标签分类问题

Question

I have a multi-label classification project for a large number of texts. 我有一个用于大量文本的多标签分类项目。 I used the tf-Idf vectorizer on the texts (train_v['doc_text']) as follows: 我在文本（train_v ['doc_text']）上使用了tf-Idf矢量化器，如下所示：

tfidf_transformer = TfidfTransformer()
X_counts = count_vect.fit_transform(train_v['doc_text']) 
X_tfidf = tfidf_transformer.fit_transform(X_counts) 
x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf_r, label_vs, test_size=0.33, random_state=9000)
sgd = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=25, tol=None, fit_intercept=True, alpha = 0.000009  )

now, I need to use the same vectorizer on a set of features (test_v['doc_text'])to predict the labels. 现在，我需要对一组功能（test_v ['doc_text']）使用相同的矢量化器来预测标签。 however, when I use the following 但是，当我使用以下

X_counts_test = count_vect.fit_transform(test_v['doc_text']) 
X_tfidf_test = tfidf_transformer.fit_transform(X_counts_test) 
predictions_test = clf.predict(X_tfidf_test)

I get an error message 我收到一条错误消息

ValueError: X has 388894 features per sample; expecting 330204

any idea on how to deal with this? 关于如何处理这个想法？

Thanks. 谢谢。

Answer 1

The problem is you are using fit_transform here which make the TfidfTransform() fit on the test data and then transform it. 问题是您在这里使用fit_transform ，它使TfidfTransform()适合test data ，然后对其进行转换。

Rather use transform method on it. 而是使用transform方法。

Also, you should use TfidfVectorizer 另外，您应该使用TfidfVectorizer

In my opinion the code should be: 我认为代码应为：

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_transformer = TfidfVectorizer()
# X_counts = count_vect.fit_transform(train_v['doc_text']) 
X_tfidf = tfidf_transformer.fit_transform(train_v['doc_text']) 
x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, label_vs, test_size=0.33, random_state=9000)
sgd = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=25, tol=None, fit_intercept=True, alpha = 0.000009  )

# X_counts_test = count_vect.fit_transform(test_v['doc_text']) 
X_tfidf_test = tfidf_transformer.transform(test_v['doc_text']) 
predictions_test = clf.predict(X_tfidf_test)

Also, why are you using count_vect I think it has no usability here and in train_test_split you are using X_tfidf_r which is not mentioned anywhere. 另外，为什么要使用count_vect我认为这里没有可用性，在train_test_split您使用的是X_tfidf_r ，在任何地方都没有提及。

TF-IDF矢量化器，用于多标签分类问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-02-13 02:25:14

TF-IDF矢量化器，用于多标签分类问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-02-13 02:25:14

解决方案1
0 已采纳 2019-02-13 02:25:14