简体   繁体   中英

Feature Importance using Imbalanced-learn library

The imblearn library is a library used for unbalanced classifications. It allows you to use scikit-learn estimators while balancing the classes using a variety of methods, from undersampling to oversampling to ensembles.

My question is however, how can I get feature improtance of the estimator after using BalancedBaggingClassifier or any other sampling method from imblearn?

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from imblearn.ensemble import BalancedBaggingClassifier 
from sklearn.tree import DecisionTreeClassifier
X, y = make_classification(n_classes=2, class_sep=2,weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape {}'.format(Counter(y)))
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
bbc = BalancedBaggingClassifier(random_state=42,base_estimator=DecisionTreeClassifier(criterion=criteria_,max_features='sqrt',random_state=1),n_estimators=2000)
bbc.fit(X_train,y_train) 

Not all estimators in sklearn allow you to get feature importances (for example, BaggingClassifier doesn't). If the estimator does, it looks like it should just be stored as estimator.feature_importances_ , since the imblearn package subclasses from sklearn classes. I don't know what estimators imblearn has implemented, so I don't know if there are any that provide feature_importances_ , but in general you should look at the sklearn documentation for the corresponding object to see if it does.

You can, in this case, look at the feature importances for each of the estimators within the BalancedBaggingClassifier , like this:

for estimator in bbc.estimators_:
    print(estimator.steps[1][1].feature_importances_)

And you can print the mean importance across the estimators like this:

print(np.mean([est.steps[1][1].feature_importances_ for est in bbc.estimators_], axis=0))

There is a shortcut around this, however it is not very efficient. The BalancedBaggingClassifier uses the RandomUnderSampler successively and fits the estimator on top. A for-loop with RandomUnderSampler can be one way of going around the pipeline method, and then call the Scikit-learn estimator directly. This will also allow to look at feature_importance:

from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler(random_state=1)

my_list=[]
for i in range(0,10): #random under sampling 10 times
    X_pl,y_pl=rus.sample(X_train,y_train,)
    my_list.append((X_pl,y_pl)) #forming tuples from samples

X_pl=[]
Y_pl=[]
for num in range(0,len(my_list)): #Creating the dataframes for input/output
    X_pl.append(pd.DataFrame(my_list[num][0]))
    Y_pl.append(pd.DataFrame(my_list[num][1]))

X_pl_=pd.concat(X_pl) #Concatenating the DataFrames
Y_pl_=pd.concat(Y_pl)

RF=RandomForestClassifier(n_estimators=2000,criterion='gini',max_features=25,random_state=1)
RF.fit(X_pl_,Y_pl_) 
RF.feature_importances_

According to scikit learn documentation, you can use impurity-based feature importance on classifications, that don't have their own using some sort of ForestClassifier . Here my classifier doesn't have feature_importances_ , I'm adding it directly.

classifier.fit(x_train, y_train)

...
...

forest = ExtraTreesClassifier(n_estimators=classifier.n_estimators,
                              random_state=classifier.random_state)

forest.fit(x_train, y_train)
classifier.feature_importances_ = forest.feature_importances_

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM