sci-kit learn：使用SelectKBest時識別相應的feature-id值

Question

我正在使用sci-kit learn（版本0.11與Python版本2.7.3）從svmlight格式的二進制分類數據集中選擇前K個特征。

我正在嘗試識別所選功能的feature-id值。 我認為這很簡單 - 很可能！ （通過feature-id，我的意思是這里描述的特征值之前的數字）

以下代碼說明了我一直在嘗試這樣做：

from sklearn.datasets import load_svmlight_file
from sklearn.feature_selection import SelectKBest

svmlight_format_train_file = 'contrived_svmlight_train_file.txt' #I present the contents of this file below

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file)

featureSelector = SelectKBest(score_func=chi2,k=2)

featureSelector.fit(X_train_data,Y_train_data)

assumed_to_be_the_feature_ids_of_the_top_k_features = list(featureSelector.get_support(indices=True)) #indices=False just gives me a list of True,False etc...

print assumed_to_be_the_feature_ids_of_the_top_k_features #this gives: [0, 2]

顯然， assumed_to_be_the_feature_ids_of_the_top_k_features不能與feature-id值對應 - 因為（見下文）我的輸入文件中的feature-id值從1開始。

現在，我懷疑assumed_to_be_the_feature_ids_of_the_top_k_features實際上可能對應於按增加值的順序排序的feature-id值的列表索引。 在我的例子中，索引0將對應於feature-id=1等 - 這樣代碼告訴我選擇了feature-id=1和feature-id=3 。

不過，如果有人可以證實或否認這一點，我將不勝感激。

提前致謝。

contrived_svmlight_train_file.txt的內容：

1 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1 1:1.000000 2:1.000000#mB
0 5:1.000000#mC
1 1:1.000000 2:1.000000#mD
0 3:1.000000 4:1.000000#mE
0 3:1.000000#mF
0 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
0 2:1.000000#mH

PS道歉沒有正確格式化（第一次在這里）; 我希望這是清晰可辨的！

Answer 1

顯然， assumed_to_be_the_feature_ids_of_the_top_k_features不能與feature-id值對應 - 因為（見下文）我的輸入文件中的feature-id值從1開始。

實際上，他們是。 SVMlight格式加載器將檢測您的輸入文件是否具有基於索引的索引，並將從每個索引中減去一個索引，以免浪費列。 如果那不是你想要的，那么將zero_based=True傳遞給load_svmlight_file ，假裝它實際上是從零開始的，並插入一個額外的列; 請參閱其文檔了解詳細信息

sci-kit learn：使用SelectKBest時識別相應的feature-id值

問題描述

1 個解決方案

解決方案1
2 已采納 2012-10-10 23:49:35

sci-kit learn：使用SelectKBest時識別相應的feature-id值

問題描述

1 個解決方案

解決方案1 2 已采納 2012-10-10 23:49:35

解決方案1
2 已采納 2012-10-10 23:49:35