how to force scikit-learn DictVectorizer not to discard features?

Question

Im trying to use scikit-learn for a classification task. My code extracts features from the data, and stores them in a dictionary like so:

feature_dict['feature_name_1'] = feature_1
feature_dict['feature_name_2'] = feature_2

when I split the data in order to test it using sklearn.cross_validation everything works as it should. The problem Im having is when the test data is a new set, not part of the learning set (although it has the same exact features for each sample). after I fit the classifier on the learning set, when I try to call clf.predict I get this error:

ValueError: X has different number of features than during model fitting.

I am assuming this has to do with this (out of the DictVectorizer docs):

Named features not encountered during fit or fit_transform will be silently ignored.

DictVectorizer has removed some of the features I guess... How do I disable/work around this feature?

Thanks

=== EDIT ===

The problem was as larsMans suggested that I was fitting the DictVectorizer twice.

Answer 1

您应该在训练集上使用fit_transform ，而仅在测试集上进行transform 。

Answer 2

Are you making sure to call the previously built scaler and selector transforms on the test data?

scaler = preprocessing.StandardScaler().fit(trainingData)
selector = SelectPercentile(f_classif, percentile=90)
selector.fit(scaler.transform(trainingData), labelsTrain)
...
...
predicted = clf.predict(selector.transform(scaler.transform(testingData)))#

how to force scikit-learn DictVectorizer not to discard features?

Question

2 answers

solution1
5 ACCPTED 2013-11-05 07:28:28

solution2
0 2013-11-04 16:14:59

how to force scikit-learn DictVectorizer not to discard features?

Question

2 answers

solution1 5 ACCPTED 2013-11-05 07:28:28

solution2 0 2013-11-04 16:14:59

solution1
5 ACCPTED 2013-11-05 07:28:28

solution2
0 2013-11-04 16:14:59