简体   繁体   English

如何解释sklearn.feature_selection中多类的_coeffs输出的特征重要性?

[英]How to interpret importance of features from _coeffs outputs for multi-class in sklearn.feature_selection?

I have a dataset of 150 samples and almost 10000 features. 我有一个150个样本和近10000个特征的数据集。 I have clustered the samples in 6 clusters. 我将样本聚集在6个簇中。 I have used sklearn.feature_selection.RFECV method to reduce the number of features. 我使用了sklearn.feature_selection.RFECV方法来减少功能的数量。 The method estimate the number of important features 3000 features wit ~95% accuracy using 10-fold CV . 该方法使用10倍CV估计3000个特征的重要特征的数量,具有~95%的准确度。 However I can get ~ 92% accuracy using around 250 features (I have plotted using grid_scores_ ). 但是,使用大约250个功能(我使用grid_scores_绘制)可以获得~92%的准确度。 Therefore, I would like to get that 250 features. 因此,我想获得250个功能。

I have checked that question Getting features in RFECV scikit-learn and found out to calculate the importances of selected features by: 我已经检查了这个问题获取RFECV中的功能scikit-learn并发现通过以下方式计算所选功能的重要性:

np.absolute(rfecv.estimator_.coef_) np.absolute(rfecv.estimator_.coef_)

which returns an array length of number of important features for binary classifications. 它返回二进制分类的重要特征数量的数组长度。 As i indicated before, i have 6 clusters and sklearn.feature_selection.RFECV does classifiacation 1 vs 1 . 正如我之前所说,我有6个集群和sklearn.feature_selection.RFECV做分类1对1 Therefore i get (15, 3000) ndarray. 因此我得到(15, 3000) 15,3000 (15, 3000) ndarray。 I do not know how to proceed. 我不知道该怎么办。 I was thinking to take dot product for each feature like that: 我想为每个功能采取点产品:

cofs = rfecv.estimator_.coef_

coeffs = []

for x in range(cofs.shape[1]):

    vec = cofs[ : , x]

    weight = vec.transpose() @ vec 

    coeffs.append(weight)

And i get array of (1,3000) . 我得到(1,3000)的数组。 I can sort these and get the results i want. 我可以对这些进行排序并得到我想要的结果。 But i am not sure whether it is right and makes sense. 但我不确定它是否正确且有意义。 I really appreciate any other solutions. 我非常感谢任何其他解决方案。

Well i delved into the source code. 我深入研究了源代码。 Here what i found, actually they are doing pretty much same thing: 在这里我发现了,实际上他们做的几乎是一样的:

# Get ranks
if coefs.ndim > 1:
    ranks = np.argsort(safe_sqr(coefs).sum(axis=0))
else:
    ranks = np.argsort(safe_sqr(coefs))

If it is multi-class problem, they sum up coefficients. 如果是多类问题,他们总结系数。 Hope that helps others. 希望能帮助别人。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 自动功能选择-Sklearn.feature_selection - Automatic feature selection - Sklearn.feature_selection F_sklearn.feature_selection的回归 - F_Regression from sklearn.feature_selection 在 sklearn.feature_selection 之后过滤 DataFrame - Filter DataFrame after sklearn.feature_selection sklearn.feature_selection中chi2的“ ValueError:长度必须匹配才能进行比较” - “ValueError: Lengths must match to compare” for chi2 from sklearn.feature_selection 具有一个热编码特征的Auto-Sklearn中的特征和特征重要性 - Features and Feature importance in Auto-Sklearn with One Hot Encoded Features sklearn.ensemble中的python特征选择特征重要性方法在多次运行中给出不一致的结果 - python feature selection feature importance method from sklearn.ensemble gives inconsistent results in multiple runs sklearn.feature_selection chi2 为不同的标签识别相同的一元和二元 - sklearn.feature_selection chi2 identifies same unigrams and bigrams for different labels sklearn如何基于特征选择来选择分类特征 - How can sklearn select categorical features based on feature selection xgboost sklearn中的feature_names不匹配在多类文本分类期间 - Mismatch in feature_names in xgboost sklearn during multi-class text classification 如何在多类分类设置中从 Logitboost 算法中提取特征重要性? - How to extract the feature importances from the Logitboost algorithm in a multi-class classification setting?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM