[英]How to calculate feature importance in each models of cross validation in sklearn
I am using RandomForestClassifier()
with 10 fold cross validation
as follows.我使用
RandomForestClassifier()
和10 fold cross validation
如下。
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())
I want to identify the important features in my feature space.我想确定特征空间中的重要特征。 It seems to be straightforward to get the feature importance for single classification as follows.
获得单个分类的特征重要性似乎很简单,如下所示。
print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
However, I could not find how to perform feature importance
for cross validation
in sklearn.但是,我找不到如何在 sklearn 中执行
cross validation
feature importance
。
In summary, I want to identify the most effective features (eg, by using an average importance score
) in the 10-folds of cross validation.总之,我想在交叉验证的 10 倍中确定最有效的特征(例如,通过使用
average importance score
)。
I am happy to provide more details if needed.如果需要,我很乐意提供更多详细信息。
cross_val_score()
does not return the estimators for each combination of train-test folds. cross_val_score()
不返回每个训练测试折叠组合的估计量。
You need to use cross_validate()
and set return_estimator =True
.您需要使用
cross_validate()
并设置return_estimator =True
。
Here is an working example:这是一个工作示例:
from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target
clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced")
output = cross_validate(clf, X, y, cv=2, scoring = 'accuracy', return_estimator =True)
for idx,estimator in enumerate(output['estimator']):
print("Features sorted by their score for estimator {}:".format(idx))
feature_importances = pd.DataFrame(estimator.feature_importances_,
index = diabetes.feature_names,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
Output:输出:
Features sorted by their score for estimator 0:
importance
s6 0.137735
age 0.130152
s5 0.114561
s2 0.113683
s3 0.112952
bmi 0.111057
bp 0.108682
s1 0.090763
s4 0.056805
sex 0.023609
Features sorted by their score for estimator 1:
importance
age 0.129671
bmi 0.125706
s2 0.125304
s1 0.113903
bp 0.111979
s6 0.110505
s5 0.106099
s3 0.098392
s4 0.054542
sex 0.023900
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.