[英]How to get feature importance in xgboost?
I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore()
, but it returns {}
我正在使用 xgboost 构建模型,并尝试使用
get_fscore()
找出每个特征的重要性,但它返回{}
and my train code is:我的火车代码是:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
So is there any mistake in my train?那么我的火车有什么错误吗? How to get feature importance in xgboost?
如何在 xgboost 中获得特征重要性?
In your code you can get feature importance for each feature in dict form:在您的代码中,您可以以字典形式获取每个功能的功能重要性:
bst.get_score(importance_type='gain')
>>{'ftr_col1': 77.21064539577829,
'ftr_col2': 10.28690566363971,
'ftr_col3': 24.225014841466294,
'ftr_col4': 11.234086283060112}
Explanation: The train() API's method get_score() is defined as:说明:train() API 的方法 get_score() 定义为:
get_score(fmap='', importance_type='weight') get_score(fmap='', importance_type='weight')
https://xgboost.readthedocs.io/en/latest/python/python_api.html https://xgboost.readthedocs.io/en/latest/python/python_api.html
Get the table containing scores and feature names , and then plot it.获取包含scores和feature names的表,然后绘制它。
feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features
For example:例如:
Using sklearn API and XGBoost >= 0.81:使用 sklearn API 和 XGBoost >= 0.81:
clf.get_booster().get_score(importance_type="gain")
or要么
regr.get_booster().get_score(importance_type="gain")
For this to work correctly, when you call regr.fit
(or clf.fit
), X
must be a pandas.DataFrame
.为了使其正常工作,当您调用
regr.fit
(或clf.fit
)时, X
必须是pandas.DataFrame
。
Build the model from XGboost first首先从 XGboost 构建模型
from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)
this would result in an array.这将产生一个数组。 So we can sort it with descending
所以我们可以降序排序
sorted_idx = np.argsort(model.feature_importances_)[::-1]
Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)然后,是时候将所有排序的重要性和列名一起打印为列表(我假设数据是用 Pandas 加载的)
for index in sorted_idx:
print([train.columns[index], model.feature_importances_[index]])
Furthermore, we can plot the importances with XGboost built-in function此外,我们可以使用 XGboost 内置函数绘制重要性图
plot_importance(model, max_num_features = 15)
pyplot.show()
use max_num_features
in plot_importance
to limit the number of features if you want.如果需要,可以使用
max_num_features
中的plot_importance
来限制功能的数量。
For feature importance Try this:对于特征重要性试试这个:
Classification:分类:
pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)
Regression:回归:
xgb.plot_importance(bst)
For anyone who comes across this issue while using xgb.XGBRegressor()
the workaround I'm using is to keep the data in a pandas.DataFrame()
or numpy.array()
and not to convert the data to dmatrix()
.对于在使用
xgb.XGBRegressor()
时遇到此问题的任何人,我使用的解决方法是将数据保存在pandas.DataFrame()
或numpy.array()
中,而不是将数据转换为dmatrix()
。 Also, I had to make sure the gamma
parameter is not specified for the XGBRegressor.另外,我必须确保没有为 XGBRegressor 指定
gamma
参数。
fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)
After fitting the regressor fit.feature_importances_
returns an array of weights which I'm assuming is in the same order as the feature columns of the pandas dataframe.拟合回归量后
fit.feature_importances_
返回一个权重数组,我假设它与 pandas 数据框的特征列的顺序相同。
My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.我当前的设置是 Ubuntu 16.04、Anaconda 发行版、python 3.6、xgboost 0.6 和 sklearn 18.1。
I don't know how to get values certainly, but there is a good way to plot features importance:我当然不知道如何获取值,但是有一种绘制特征重要性的好方法:
model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
According to this post there 3 different ways to get feature importance from Xgboost:根据这篇文章,有 3 种不同的方法可以从 Xgboost 获得特征重要性:
Code example:代码示例:
xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")
Please be aware of what type of feature importance you are using.请注意您正在使用哪种类型的功能重要性。 There are several types of importance, see the docs .
重要性有几种类型,请参阅文档。 The
scikit-learn
like API of Xgboost is returning gain
importance while get_fscore
returns weight
type. Xgboost 的类似
scikit-learn
API 返回gain
重要性,而get_fscore
返回weight
类型。
perm_importance = permutation_importance(xgb, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
This is my preferred way to compute the importance.这是我计算重要性的首选方法。 However, it can fail in case highly colinear features, so be careful!
但是,如果特征高度共线,它可能会失败,所以要小心! It's using
permutation_importance
from scikit-learn
.它使用
scikit-learn
中的permutation_importance
。
explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
To use the above code, you need to have shap
package installed.要使用上面的代码,您需要安装
shap
包。
I was running the example analysis on Boston data (house price regression from scikit-learn).我正在对波士顿数据运行示例分析(来自 scikit-learn 的房价回归)。 Below 3 feature importance:
以下 3 个特征重要性:
All plots are for the same model, As you see.如您所见,所有图都是针对同一模型的。 there is a difference in the results.
结果有所不同。 I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).
我更喜欢基于排列的重要性,因为我清楚地了解哪些特征会影响模型的性能(如果没有高共线性)。
Try this试试这个
fscore = clf.best_estimator_.booster().get_fscore()
In case you are using XGBRegressor, try with: model.get_booster().get_score()
.如果您使用的是 XGBRegressor,请尝试使用:
model.get_booster().get_score()
。
That returns the results that you can directly visualize through plot_importance
command这将返回您可以通过
plot_importance
命令直接可视化的结果
None of the above worked for me, this was the code I ended up with, to sort features by importance.以上都不适合我,这是我最终得到的代码,用于按重要性对功能进行排序。
from collections import Counter
Counter({k: v for k, v in sorted(model.get_fscore().items(), key=lambda item: item[1], reverse = True)}).most_common
just replace model with the name of your model and everything will be there.只需用您的模型名称替换模型,一切都会在那里。 Of course I'm doing the same thing twice, there's no need to order a dict before passing to counter, but I figure it wouldn't hurt to leave it there in case anyone hates Counters.
当然我做了两次同样的事情,在传递给柜台之前不需要命令命令,但我认为把它留在那里不会有什么坏处,以防有人讨厌柜台。
print(model.feature_importances_)
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore()
, but it returns {}
我正在使用 xgboost 构建模型,并尝试使用
get_fscore()
找到每个功能的重要性,但它返回{}
and my train code is:我的火车代码是:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
So is there any mistake in my train?那么我的火车有什么错误吗? How to get feature importance in xgboost?
如何在 xgboost 中获得特征重要性?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.