tfidfvectorizer 在保存的分类器中预测

Question

I trained my model by using TfIdfVectorizer and MultinomialNB and I saved it into a pickle file.我使用 TfIdfVectorizer 和 MultinomialNB 训练了我的 model，并将其保存到 pickle 文件中。

Now that I am trying to use the classifier from another file to predict in unseen data, I cannot do it because it is telling my that the number of features of the classifier is not the same than the number of features of my current corpus.现在我正在尝试使用另一个文件中的分类器来预测看不见的数据，我不能这样做，因为它告诉我分类器的特征数量与我当前语料库的特征数量不同。

This is the code where I am trying to predict.这是我试图预测的代码。 The function do_vectorize is exactly the same used in training. function do_vectorize 与训练中使用的完全相同。

def do_vectorize(data, stop_words=[], tokenizer_fn=tokenize):
    vectorizer = TfidfVectorizer(stop_words=stop_words, tokenizer=tokenizer_fn)
    X = vectorizer.fit_transform(data)
    return X, vectorizer

# Vectorizing the unseen documents 
matrix, vectorizer = do_vectorize(corpus, stop_words=stop_words)

# Predicting on the trained model
clf = pickle.load(open('../data/classifier_0.5_function.pkl', 'rb'))
predictions = clf.predict(matrix)

However I receive the error that the number of features are different但是我收到错误信息，指出功能数量不同

ValueError: Expected input with 65264 features, got 472546 instead

This means I also have to save my vocabulary from training in order to test?这意味着我还必须从训练中保存词汇以进行测试？ What will happen if there are terms that did not exist on training?如果训练中不存在术语，会发生什么情况？

I tried to used pipelines from scikit-learn with the same vectorizer and classifier, and the same parameters for both.我尝试使用来自 scikit-learn 的管道，它们具有相同的矢量化器和分类器，并且两者的参数相同。 However, it turned too slow from 1 hour to more than 6 hours, so I prefer to do it manually.不过从1小时变成6个多小时太慢了，所以我更喜欢手动做。

Answer 1

This means I also have to save my vocabulary from training in order to test?这意味着我还必须从训练中保存我的词汇以进行测试？

Yes, you have to save whole tfidf vectorizer , which in particular means saving vocabulary.是的，您必须保存整个 tfidf vectorizer ，这尤其意味着保存词汇。

What will happen if there are terms that did not exist on training?如果有训练中不存在的术语会发生什么？

They will be ignored, which makes perfect sense since you have no training data about this, thus there is nothing to take into consideration (there are more complex methods which could still use it, but they do not use such simple approaches as tfidf).它们将被忽略，这是完全合理的，因为您没有关于此的训练数据，因此无需考虑（有更复杂的方法仍然可以使用它，但它们不使用像 tfidf 这样简单的方法）。

I tried to used pipelines from scikit-learn with the same vectorizer and classifier, and the same parameters for both.我尝试使用来自 scikit-learn 的管道和相同的向量化器和分类器，以及两者的相同参数。 However, it turned too slow from 1 hour to more than 6 hours, so I prefer to do it manually.但是，它从 1 小时变成了 6 多小时太慢了，所以我更喜欢手动完成。

There should be little to no overhead when using pipelines, however doing things manually is fine as long as you remember to store vectorizer as well.使用管道时应该几乎没有开销，但是只要您还记得存储矢量化器，手动操作就可以了。

Answer 2

You have to assign max feature limit while intilizing the tfidf vectorizer like this您必须像这样在使用 tfidf 向量化器时分配最大特征限制

tfidf_vectorizer = TfidfVectorizer(max_features = 1200)

and then use same features limit to convert test data into tfidf然后使用相同的功能限制将测试数据转换为 tfidf

tfidfvectorizer 在保存的分类器中预测

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-10-10 19:01:36

解决方案2
0 2022-10-03 08:47:34

tfidfvectorizer 在保存的分类器中预测

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-10-10 19:01:36

解决方案2 0 2022-10-03 08:47:34

解决方案1
3 已采纳 2016-10-10 19:01:36

解决方案2
0 2022-10-03 08:47:34