如何在 xgboost 中获得特征重要性？

Question

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore() , but it returns {}我正在使用 xgboost 构建模型，并尝试使用get_fscore()找出每个特征的重要性，但它返回{}

and my train code is:我的火车代码是：

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train?那么我的火车有什么错误吗？ How to get feature importance in xgboost?如何在 xgboost 中获得特征重要性？

Answer 1

In your code you can get feature importance for each feature in dict form:在您的代码中，您可以以字典形式获取每个功能的功能重要性：

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}

Explanation: The train() API's method get_score() is defined as:说明：train() API 的方法 get_score() 定义为：

get_score(fmap='', importance_type='weight') get_score(fmap='', importance_type='weight')

fmap (str (optional)) – The name of feature map file. fmap (str (optional)) – 特征映射文件的名称。
importance_type重要性类型
- 'weight' - the number of times a feature is used to split the data across all trees. 'weight' - 一个特征被用来在所有树上分割数据的次数。
- 'gain' - the average gain across all splits the feature is used in. 'gain' - 使用该特征的所有拆分的平均增益。
- 'cover' - the average coverage across all splits the feature is used in. 'cover' - 使用该特征的所有分割的平均覆盖率。
- 'total_gain' - the total gain across all splits the feature is used in. 'total_gain' - 使用该特征的所有拆分的总增益。
- 'total_cover' - the total coverage across all splits the feature is used in. 'total_cover' - 使用该特征的所有分割的总覆盖率。

https://xgboost.readthedocs.io/en/latest/python/python_api.html https://xgboost.readthedocs.io/en/latest/python/python_api.html

Answer 2

Get the table containing scores and feature names , and then plot it.获取包含scores和feature names的表，然后绘制它。

feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features

For example:例如：

Answer 3

Using sklearn API and XGBoost >= 0.81:使用 sklearn API 和 XGBoost >= 0.81：

clf.get_booster().get_score(importance_type="gain")

or要么

regr.get_booster().get_score(importance_type="gain")

For this to work correctly, when you call regr.fit (or clf.fit ), X must be a pandas.DataFrame .为了使其正常工作，当您调用regr.fit （或clf.fit ）时， X必须是pandas.DataFrame 。

Answer 4

Build the model from XGboost first首先从 XGboost 构建模型

from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)

this would result in an array.这将产生一个数组。 So we can sort it with descending所以我们可以降序排序

sorted_idx = np.argsort(model.feature_importances_)[::-1]

Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)然后，是时候将所有排序的重要性和列名一起打印为列表（我假设数据是用 Pandas 加载的）

for index in sorted_idx:
    print([train.columns[index], model.feature_importances_[index]])

Furthermore, we can plot the importances with XGboost built-in function此外，我们可以使用 XGboost 内置函数绘制重要性图

plot_importance(model, max_num_features = 15)
pyplot.show()

use max_num_features in plot_importance to limit the number of features if you want.如果需要，可以使用max_num_features中的plot_importance来限制功能的数量。

Answer 5

For feature importance Try this:对于特征重要性试试这个：

Classification:分类：

pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)

Regression:回归：

xgb.plot_importance(bst)

Answer 6

For anyone who comes across this issue while using xgb.XGBRegressor() the workaround I'm using is to keep the data in a pandas.DataFrame() or numpy.array() and not to convert the data to dmatrix() .对于在使用xgb.XGBRegressor()时遇到此问题的任何人，我使用的解决方法是将数据保存在pandas.DataFrame()或numpy.array()中，而不是将数据转换为dmatrix() 。 Also, I had to make sure the gamma parameter is not specified for the XGBRegressor.另外，我必须确保没有为 XGBRegressor 指定gamma参数。

fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)

After fitting the regressor fit.feature_importances_ returns an array of weights which I'm assuming is in the same order as the feature columns of the pandas dataframe.拟合回归量后fit.feature_importances_返回一个权重数组，我假设它与 pandas 数据框的特征列的顺序相同。

My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.我当前的设置是 Ubuntu 16.04、Anaconda 发行版、python 3.6、xgboost 0.6 和 sklearn 18.1。

Answer 7

I don't know how to get values certainly, but there is a good way to plot features importance:我当然不知道如何获取值，但是有一种绘制特征重要性的好方法：

model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

Answer 8

According to this post there 3 different ways to get feature importance from Xgboost:根据这篇文章，有 3 种不同的方法可以从 Xgboost 获得特征重要性：

use built-in feature importance,使用内置的特征重要性，
use permutation based importance,使用基于排列的重要性，
use shap based importance.使用基于形状的重要性。

Built-in feature importance内置特征重要性

Code example:代码示例：

xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")

Please be aware of what type of feature importance you are using.请注意您正在使用哪种类型的功能重要性。 There are several types of importance, see the docs .重要性有几种类型，请参阅文档。 The scikit-learn like API of Xgboost is returning gain importance while get_fscore returns weight type. Xgboost 的类似scikit-learn API 返回gain重要性，而get_fscore返回weight类型。

Permutation based importance基于排列的重要性

perm_importance = permutation_importance(xgb, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")

This is my preferred way to compute the importance.这是我计算重要性的首选方法。 However, it can fail in case highly colinear features, so be careful!但是，如果特征高度共线，它可能会失败，所以要小心！ It's using permutation_importance from scikit-learn .它使用scikit-learn中的permutation_importance 。

SHAP based importance基于 SHAP 的重要性

explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

To use the above code, you need to have shap package installed.要使用上面的代码，您需要安装shap包。

I was running the example analysis on Boston data (house price regression from scikit-learn).我正在对波士顿数据运行示例分析（来自 scikit-learn 的房价回归）。 Below 3 feature importance:以下 3 个特征重要性：

Built-in importance内在重要性

Permutation based importance基于排列的重要性

SHAP importance SHAP重要性

All plots are for the same model, As you see.如您所见，所有图都是针对同一模型的。 there is a difference in the results.结果有所不同。 I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).我更喜欢基于排列的重要性，因为我清楚地了解哪些特征会影响模型的性能（如果没有高共线性）。

Answer 9

Try this试试这个

fscore = clf.best_estimator_.booster().get_fscore()

Answer 10

In case you are using XGBRegressor, try with: model.get_booster().get_score() .如果您使用的是 XGBRegressor，请尝试使用： model.get_booster().get_score() 。

That returns the results that you can directly visualize through plot_importance command这将返回您可以通过plot_importance命令直接可视化的结果

Answer 11

None of the above worked for me, this was the code I ended up with, to sort features by importance.以上都不适合我，这是我最终得到的代码，用于按重要性对功能进行排序。

from collections import Counter
Counter({k: v for k, v in sorted(model.get_fscore().items(), key=lambda item: item[1], reverse = True)}).most_common

just replace model with the name of your model and everything will be there.只需用您的模型名称替换模型，一切都会在那里。 Of course I'm doing the same thing twice, there's no need to order a dict before passing to counter, but I figure it wouldn't hurt to leave it there in case anyone hates Counters.当然我做了两次同样的事情，在传递给柜台之前不需要命令命令，但我认为把它留在那里不会有什么坏处，以防有人讨厌柜台。

Answer 12

print(model.feature_importances_)

plt.bar(range(len(model.feature_importances_)), model.feature_importances_)

Answer 13

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore() , but it returns {}我正在使用 xgboost 构建模型，并尝试使用get_fscore()找到每个功能的重要性，但它返回{}

and my train code is:我的火车代码是：

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train?那么我的火车有什么错误吗？ How to get feature importance in xgboost?如何在 xgboost 中获得特征重要性？

如何在 xgboost 中获得特征重要性？

问题描述

11 个解决方案

解决方案1
57 2018-08-02 03:29:16

解决方案2
45 2018-10-12 10:47:59

解决方案3
31 2019-03-20 19:15:29

解决方案4
14 2018-06-14 01:37:18

解决方案5
11 2016-08-23 17:58:44

解决方案6
9 2017-02-17 17:54:35

解决方案7
9 2017-07-08 20:12:39

解决方案8
9 2020-08-28 10:47:23

Built-in feature importance内置特征重要性

Permutation based importance基于排列的重要性

SHAP based importance基于 SHAP 的重要性

Built-in importance内在重要性

Permutation based importance基于排列的重要性

SHAP importance SHAP重要性

解决方案9
7 2017-02-16 13:00:12

解决方案10
2 2020-04-24 18:09:19

解决方案11
1 2021-08-16 03:52:22

解决方案12
0 2019-08-19 17:45:01

解决方案13
0 2021-05-12 07:21:22

如何在 xgboost 中获得特征重要性？

问题描述

11 个解决方案

解决方案1 57 2018-08-02 03:29:16

解决方案2 45 2018-10-12 10:47:59

解决方案3 31 2019-03-20 19:15:29

解决方案4 14 2018-06-14 01:37:18

解决方案5 11 2016-08-23 17:58:44

解决方案6 9 2017-02-17 17:54:35

解决方案7 9 2017-07-08 20:12:39

解决方案8 9 2020-08-28 10:47:23

Built-in feature importance内置特征重要性

Permutation based importance基于排列的重要性

SHAP based importance基于 SHAP 的重要性

Built-in importance内在重要性

Permutation based importance基于排列的重要性

SHAP importance SHAP重要性

解决方案9 7 2017-02-16 13:00:12

解决方案10 2 2020-04-24 18:09:19

解决方案11 1 2021-08-16 03:52:22

解决方案12 0 2019-08-19 17:45:01

解决方案13 0 2021-05-12 07:21:22

解决方案1
57 2018-08-02 03:29:16

解决方案2
45 2018-10-12 10:47:59

解决方案3
31 2019-03-20 19:15:29

解决方案4
14 2018-06-14 01:37:18

解决方案5
11 2016-08-23 17:58:44

解决方案6
9 2017-02-17 17:54:35

解决方案7
9 2017-07-08 20:12:39

解决方案8
9 2020-08-28 10:47:23

解决方案9
7 2017-02-16 13:00:12

解决方案10
2 2020-04-24 18:09:19

解决方案11
1 2021-08-16 03:52:22

解决方案12
0 2019-08-19 17:45:01

解决方案13
0 2021-05-12 07:21:22