简体   繁体   中英

How to interpret importance of features from _coeffs outputs for multi-class in sklearn.feature_selection?

I have a dataset of 150 samples and almost 10000 features. I have clustered the samples in 6 clusters. I have used sklearn.feature_selection.RFECV method to reduce the number of features. The method estimate the number of important features 3000 features wit ~95% accuracy using 10-fold CV . However I can get ~ 92% accuracy using around 250 features (I have plotted using grid_scores_ ). Therefore, I would like to get that 250 features.

I have checked that question Getting features in RFECV scikit-learn and found out to calculate the importances of selected features by:

np.absolute(rfecv.estimator_.coef_)

which returns an array length of number of important features for binary classifications. As i indicated before, i have 6 clusters and sklearn.feature_selection.RFECV does classifiacation 1 vs 1 . Therefore i get (15, 3000) ndarray. I do not know how to proceed. I was thinking to take dot product for each feature like that:

cofs = rfecv.estimator_.coef_

coeffs = []

for x in range(cofs.shape[1]):

    vec = cofs[ : , x]

    weight = vec.transpose() @ vec 

    coeffs.append(weight)

And i get array of (1,3000) . I can sort these and get the results i want. But i am not sure whether it is right and makes sense. I really appreciate any other solutions.

Well i delved into the source code. Here what i found, actually they are doing pretty much same thing:

# Get ranks
if coefs.ndim > 1:
    ranks = np.argsort(safe_sqr(coefs).sum(axis=0))
else:
    ranks = np.argsort(safe_sqr(coefs))

If it is multi-class problem, they sum up coefficients. Hope that helps others.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM