简体   繁体   中英

sci-kit learn: Identifying the corresponding feature-id values when using SelectKBest

I am using sci-kit learn (version 0.11 with Python version 2.7.3) to select the top K features from a binary classification dataset in svmlight format.

I am trying to identify the feature-id values of the selected features. I assumed this would be quite simple - and may well be! (By feature-id, I mean the number before the feature value as described here )

The following code illustrates exactly how I have been trying to do this:

from sklearn.datasets import load_svmlight_file
from sklearn.feature_selection import SelectKBest

svmlight_format_train_file = 'contrived_svmlight_train_file.txt' #I present the contents of this file below

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file)

featureSelector = SelectKBest(score_func=chi2,k=2)

featureSelector.fit(X_train_data,Y_train_data)

assumed_to_be_the_feature_ids_of_the_top_k_features = list(featureSelector.get_support(indices=True)) #indices=False just gives me a list of True,False etc...

print assumed_to_be_the_feature_ids_of_the_top_k_features #this gives: [0, 2]

Clearly, assumed_to_be_the_feature_ids_of_the_top_k_features cannot correspond to the feature-id values - since (see below) the feature-id values in my input file start from 1.

Now, I suspect that assumed_to_be_the_feature_ids_of_the_top_k_features may, in fact, correspond to the list indices of the feature-id values sorted in order of increasing value. In my case, index 0 would correspond to feature-id=1 etc. - such that the code is telling me that feature-id=1 and feature-id=3 were selected.

I'd be grateful if someone could either confirm or deny this, however.

Thanks in advance.

Contents of contrived_svmlight_train_file.txt :

1 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1 1:1.000000 2:1.000000#mB
0 5:1.000000#mC
1 1:1.000000 2:1.000000#mD
0 3:1.000000 4:1.000000#mE
0 3:1.000000#mF
0 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
0 2:1.000000#mH

PS Apologies for not formatting correctly (first time here); I hope this is legible and comprehensible!

Clearly, assumed_to_be_the_feature_ids_of_the_top_k_features cannot correspond to the feature-id values - since (see below) the feature-id values in my input file start from 1.

Actually, they are. The SVMlight format loader will detect that your input file has one-based indices and will subtract one from every index so as not to waste a column. If that's not what you want, then pass zero_based=True to load_svmlight_file to pretend that it's actually zero-based and insert an extra column; see its documentation for details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM