简体   繁体   English

TfIdf矩阵为BernoulliNB返回错误的功能数量

[英]TfIdf matrix returns wrong number of features for BernoulliNB

Using the python lib sklearn, I tried to extract features from a trainingsset and fit a BernoulliNB classifier with this data. 使用python lib sklearn,我尝试从trainingsets中提取特征,并使用此数据拟合BernoulliNB分类器。

After the classifier ist trained, i want to predict (classify) some new test data. 在分类器经过培训之后,我想预测(分类)一些新的测试数据。 Unfortunately I get this error: 不幸的是我得到这个错误:

Traceback (most recent call last):
File "sentiment_analysis.py", line 45, in <module> main()
File "sentiment_analysis.py", line 41, in main
  prediction = classifier.predict(tfidf_data)
File "\Python27\lib\site-packages\sklearn\naive_bayes.py", line 64, in predict
  jll = self._joint_log_likelihood(X)
File "\Python27\lib\site-packages\sklearn\naive_bayes.py", line 724, in _joint_log_likelihood
  % (n_features, n_features_X))
ValueError: Expected input with 4773 features, got 13006 instead

This is my code: 这是我的代码:

#Train the Classifier
data,target = load_file('validation/validation_set_5.csv')
tf_idf = preprocess(data)
classifier = BernoulliNB().fit(tf_idf, target)

#Predict test data
count_vectorizer = CountVectorizer(binary='true')
test = count_vectorizer.fit_transform(test)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(test)
prediction = classifier.predict(tfidf_data)

That's why you having this error: 这就是为什么您会遇到此错误:

test = count_vectorizer.fit_transform(test)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(test)

You should use here only old transformers (CountVectorizer and TfidfTransformer is your transformers) fitted on trainset. 在这里,您应该只使用安装在火车上的旧变压器(CountVectorizer和TfidfTransformer是您的变压器)。

fit_transform fit_transform

means that you fit these transformers on new set, loosing all information about old fit, and then transform 'test' with this transformer (learned on new samples, and with different set of features). 意味着您可以将这些转换器安装在新的转换器上,失去所有有关旧版拟合的信息,然后使用该转换器转换“测试”(从新样本中学习,并具有不同的功能集)。 Thus it returns testset transformed to new set of features, incompatible with old set of features used in training set. 因此,它返回转换为新功能集的测试集,该新功能集与训练集中使用的旧功能集不兼容。 To fix this you should use transform (not fit_transform) method on old transformers, fitted on training set. 要解决此问题,您应该在适合训练集的旧变压器上使用transform(not fit_transform)方法。

You should write something like: 您应该编写如下内容:

test = old_count_vectorizer.transform(test)
tfidf_data = old_tfidf_transformer.transform(test)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM