sci-kit learn：使用SelectKBest时识别相应的feature-id值

Question

I am using sci-kit learn (version 0.11 with Python version 2.7.3) to select the top K features from a binary classification dataset in svmlight format. 我正在使用sci-kit learn（版本0.11与Python版本2.7.3）从svmlight格式的二进制分类数据集中选择前K个特征。

I am trying to identify the feature-id values of the selected features. 我正在尝试识别所选功能的feature-id值。 I assumed this would be quite simple - and may well be! 我认为这很简单 - 很可能！ (By feature-id, I mean the number before the feature value as described here ) （通过feature-id，我的意思是这里描述的特征值之前的数字）

The following code illustrates exactly how I have been trying to do this: 以下代码说明了我一直在尝试这样做：

from sklearn.datasets import load_svmlight_file
from sklearn.feature_selection import SelectKBest

svmlight_format_train_file = 'contrived_svmlight_train_file.txt' #I present the contents of this file below

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file)

featureSelector = SelectKBest(score_func=chi2,k=2)

featureSelector.fit(X_train_data,Y_train_data)

assumed_to_be_the_feature_ids_of_the_top_k_features = list(featureSelector.get_support(indices=True)) #indices=False just gives me a list of True,False etc...

print assumed_to_be_the_feature_ids_of_the_top_k_features #this gives: [0, 2]

Clearly, assumed_to_be_the_feature_ids_of_the_top_k_features cannot correspond to the feature-id values - since (see below) the feature-id values in my input file start from 1. 显然， assumed_to_be_the_feature_ids_of_the_top_k_features不能与feature-id值对应 - 因为（见下文）我的输入文件中的feature-id值从1开始。

Now, I suspect that assumed_to_be_the_feature_ids_of_the_top_k_features may, in fact, correspond to the list indices of the feature-id values sorted in order of increasing value. 现在，我怀疑assumed_to_be_the_feature_ids_of_the_top_k_features实际上可能对应于按增加值的顺序排序的feature-id值的列表索引。 In my case, index 0 would correspond to feature-id=1 etc. - such that the code is telling me that feature-id=1 and feature-id=3 were selected. 在我的例子中，索引0将对应于feature-id=1等 - 这样代码告诉我选择了feature-id=1和feature-id=3 。

I'd be grateful if someone could either confirm or deny this, however. 不过，如果有人可以证实或否认这一点，我将不胜感激。

Thanks in advance. 提前致谢。

Contents of contrived_svmlight_train_file.txt : contrived_svmlight_train_file.txt的内容：

1 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1 1:1.000000 2:1.000000#mB
0 5:1.000000#mC
1 1:1.000000 2:1.000000#mD
0 3:1.000000 4:1.000000#mE
0 3:1.000000#mF
0 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
0 2:1.000000#mH

PS Apologies for not formatting correctly (first time here); PS道歉没有正确格式化（第一次在这里）; I hope this is legible and comprehensible! 我希望这是清晰可辨的！

Answer 1

Clearly, assumed_to_be_the_feature_ids_of_the_top_k_features cannot correspond to the feature-id values - since (see below) the feature-id values in my input file start from 1. 显然， assumed_to_be_the_feature_ids_of_the_top_k_features不能与feature-id值对应 - 因为（见下文）我的输入文件中的feature-id值从1开始。

Actually, they are. 实际上，他们是。 The SVMlight format loader will detect that your input file has one-based indices and will subtract one from every index so as not to waste a column. SVMlight格式加载器将检测您的输入文件是否具有基于索引的索引，并将从每个索引中减去一个索引，以免浪费列。 If that's not what you want, then pass zero_based=True to load_svmlight_file to pretend that it's actually zero-based and insert an extra column; 如果那不是你想要的，那么将zero_based=True传递给load_svmlight_file ，假装它实际上是从零开始的，并插入一个额外的列; see its documentation for details. 请参阅其文档了解详细信息

sci-kit learn：使用SelectKBest时识别相应的feature-id值

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-10-10 23:49:35

sci-kit learn：使用SelectKBest时识别相应的feature-id值

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-10-10 23:49:35

解决方案1
2 已采纳 2012-10-10 23:49:35