简体   繁体   中英

Determine what features to drop / select using GridSearch in scikit-learn

How does one determine what features/columns/attributes to drop using GridSearch results?

In other words, if GridSearch returns that max_features should be 3, can we determine which EXACT 3 features should one use?

Let's take the classic Iris data set with 4 features.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold 
from sklearn.model_selection import GridSearchCV
from sklearn import datasets

iris = datasets.load_iris()
all_inputs = iris.data
all_labels = iris.target

decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
              'max_features': [1, 2, 3, 4]}

cross_validation = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(decision_tree_classifier,
                       param_grid=parameter_grid,
                       cv=cross_validation)

grid_search.fit(all_inputs, all_labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

Let's say we get that max_features is 3. How do I find out which 3 features were the most appropriate here?

Putting in max_features = 3 will work for fitting, but I want to know which attributes were the right ones.

Do I have to generate the possible list of all feature combinations myself to feed GridSearch or is there an easier way ?

max_features is one hyperparameter of your decision tree. it does not drop any of your features before training nor does it find good or bad features.

Your decisiontree looks at all features to find the best feature to split up your data based on your labels. If you set maxfeatures to 3 as in your example, your decision tree just looks at three random features and takes the best features of those to make the split. This makes your training faster and adds some randomness to your classifier (might also help against overfitting).

Your classifier determines which is the feature by a criterion (like gini index or information gain(1-entropy)). So you can either take such a measurement for feature importance or

use an estimator that has the attribute feature_importances_

as @gorjan mentioned.

If you use an estimator that has the attribute feature_importances_ you can simply do:

feature_importances = grid_search.best_estimator_.feature_importances_

This will return a list (n_features) of how important each feature was for the best estimator found with grid search. Additionally, if you want to use let's say a linear classifier (logistic regression), that doesn't have the attribute feature_importances_ what you could do is:

# Get the best estimator's coefficients
estimator_coeff = grid_search.best_estimator_.coef_
# Multiply the model coefficients by the standard deviation of the data
coeff_magnitude = np.std(all_inputs, 0) * estimator_coeff)

which is also an indication of the feature importance. If a model's coefficient is >> 0 or << 0 , that means, in layman's terms, that the model is trying hard to capture the signal present in that feature.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM