简体   繁体   中英

TfIdf matrix returns wrong number of features for BernoulliNB

Using the python lib sklearn, I tried to extract features from a trainingsset and fit a BernoulliNB classifier with this data.

After the classifier ist trained, i want to predict (classify) some new test data. Unfortunately I get this error:

Traceback (most recent call last):
File "sentiment_analysis.py", line 45, in <module> main()
File "sentiment_analysis.py", line 41, in main
  prediction = classifier.predict(tfidf_data)
File "\Python27\lib\site-packages\sklearn\naive_bayes.py", line 64, in predict
  jll = self._joint_log_likelihood(X)
File "\Python27\lib\site-packages\sklearn\naive_bayes.py", line 724, in _joint_log_likelihood
  % (n_features, n_features_X))
ValueError: Expected input with 4773 features, got 13006 instead

This is my code:

#Train the Classifier
data,target = load_file('validation/validation_set_5.csv')
tf_idf = preprocess(data)
classifier = BernoulliNB().fit(tf_idf, target)

#Predict test data
count_vectorizer = CountVectorizer(binary='true')
test = count_vectorizer.fit_transform(test)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(test)
prediction = classifier.predict(tfidf_data)

That's why you having this error:

test = count_vectorizer.fit_transform(test)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(test)

You should use here only old transformers (CountVectorizer and TfidfTransformer is your transformers) fitted on trainset.

fit_transform

means that you fit these transformers on new set, loosing all information about old fit, and then transform 'test' with this transformer (learned on new samples, and with different set of features). Thus it returns testset transformed to new set of features, incompatible with old set of features used in training set. To fix this you should use transform (not fit_transform) method on old transformers, fitted on training set.

You should write something like:

test = old_count_vectorizer.transform(test)
tfidf_data = old_tfidf_transformer.transform(test)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM