简体   繁体   English

监督学习的功能选择

[英]Feature Selection for Supervised Learning

import numpy as np
from sklearn import svm
from sklearn.feature_selection import SelectKBest, f_classif

I have 3 labels (male, female, na), denoted as follows: 我有3个标签(男,女,呐),表示如下:

labels = [0,1,2]

Each label was defined by 3 features (height, weight, and age) as the training data: 每个标签由3个特征(身高,体重和年龄)定义为训练数据:

Training data for males: 男性训练数据:

male_height = np.array([111,121,137,143,157])
male_weight = np.array([60,70,88,99,75])
male_age = np.array([41,32,73,54,35])

males = np.vstack([male_height,male_weight,male_age]).T

Training data for females: 女性培训数据:

female_height = np.array([91,121,135,98,90])
female_weight = np.array([32,67,98,86,56])
female_age = np.array([51,35,33,67,61])

females = np.vstack([female_height,female_weight,female_age]).T

Training data for not availables: 培训数据不可用:

na_height = np.array([96,127,145,99,91])
na_weight = np.array([42,97,78,76,86])
na_age = np.array([56,35,49,64,66])

nas = np.vstack([na_height,na_weight,na_age]).T

So, the complete training data are: 因此,完整的培训数据为:

trainingData = np.vstack([males,females,nas])

Complete labels are: 完整的标签是:

labels =  np.repeat(labels,5)

Now, I want to select the best features, output their names, and apply only those best features for fitting the support vector machine model. 现在,我想选择最佳特征,输出其名称,并仅应用那些最佳特征来拟合支持向量机模型。

I tried below according to the answer from @eickenberg and the comments from @larsmans 我根据@eickenberg的回答和@larsmans的评论在下面尝试

selector = SelectKBest(f_classif, k=keep)
clf = make_pipeline(selector, StandardScaler(), svm.SVC())
clf.fit(trainingData, labels)

selected = trainingData[selector.get_support()]

print selected

[[111 60 41]
 [121 70 32]]

However, all the selected elements belongs to the label 'male' with the features: height, weight, and age respectively. 但是,所有选定元素都属于标签“男性”,其特征分别是:身高,体重和年龄。 I could not figure out where I am messing up? 我不知道自己在哪里搞砸了? Could someone guide me into right direction? 有人可以引导我朝正确的方向发展吗?

You can use eg SelectKBest as follows 您可以按如下方式使用例如SelectKBest

from sklearn.feature_selection import SelectKBest, f_classif
keep = 2
selector = SelectKBest(f_classif, k=keep)

and place it into your pipeline 并将其放入您的管道中

pipe = make_pipeline(selector, StandardScaler(), svm.SVC())

pipe.fit(trainingData, labels)

To be honest, I have used the Support Vector Machine Model on text classification (which is an entirely different problem altogether). 老实说,我在文本分类上使用了支持向量机模型(这是一个完全不同的问题)。 But, through that experience, I can confidently say that the more features you have, the better your predictions will be. 但是,通过这种经验,我可以自信地说,您拥有的功能越多,您的预测就越好。

To summarize, do not filter out the features that are most important because the Support Vector Machine will make use of features no matter how little importance. 总而言之,不要滤除最重要的功能,因为支持向量机将利用这些功能,无论其重要性如何。

But, if this is a huge necessity, look into scikit learn's Random Forest Classifier . 但是,如果这是非常必要的话,请查看scikit Learn的Random Forest Classifier It can accurately assess which features are more important, using the "feature_importances_" attribute. 使用“ feature_importances_”属性,它可以准确评估哪些功能更重要。

Here's an example of how I would use it (code not tested): 这是我将如何使用它的示例(未经测试的代码):

clf = RandomForestClassifier() #tweak the parameters yourself
clf.fit(X,Y) #if you're passing in a sparse matrix, apply .toarray() to X
print clf.feature_importances_

Hope that helps. 希望能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM