简体   繁体   English

如何在 xgboost 中获得特征重要性?

[英]How to get feature importance in xgboost?

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore() , but it returns {}我正在使用 xgboost 构建模型,并尝试使用get_fscore()找出每个特征的重要性,但它返回{}

and my train code is:我的火车代码是:

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train?那么我的火车有什么错误吗? How to get feature importance in xgboost?如何在 xgboost 中获得特征重要性?

In your code you can get feature importance for each feature in dict form:在您的代码中,您可以以字典形式获取每个功能的功能重要性:

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}

Explanation: The train() API's method get_score() is defined as:说明:train() API 的方法 get_score() 定义为:

get_score(fmap='', importance_type='weight') get_score(fmap='', importance_type='weight')

  • fmap (str (optional)) – The name of feature map file. fmap (str (optional)) – 特征映射文件的名称。
  • importance_type重要性类型
    • 'weight' - the number of times a feature is used to split the data across all trees. 'weight' - 一个特征被用来在所有树上分割数据的次数。
    • 'gain' - the average gain across all splits the feature is used in. 'gain' - 使用该特征的所有拆分的平均增益。
    • 'cover' - the average coverage across all splits the feature is used in. 'cover' - 使用该特征的所有分割的平均覆盖率。
    • 'total_gain' - the total gain across all splits the feature is used in. 'total_gain' - 使用该特征的所有拆分的总增益。
    • 'total_cover' - the total coverage across all splits the feature is used in. 'total_cover' - 使用该特征的所有分割的总覆盖率。

https://xgboost.readthedocs.io/en/latest/python/python_api.html https://xgboost.readthedocs.io/en/latest/python/python_api.html

Get the table containing scores and feature names , and then plot it.获取包含scoresfeature names的表,然后绘制它。

feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features

For example:例如:

在此处输入图像描述

Using sklearn API and XGBoost >= 0.81:使用 sklearn API 和 XGBoost >= 0.81:

clf.get_booster().get_score(importance_type="gain")

or要么

regr.get_booster().get_score(importance_type="gain")

For this to work correctly, when you call regr.fit (or clf.fit ), X must be a pandas.DataFrame .为了使其正常工作,当您调用regr.fit (或clf.fit )时, X必须是pandas.DataFrame

Build the model from XGboost first首先从 XGboost 构建模型

from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)

this would result in an array.这将产生一个数组。 So we can sort it with descending所以我们可以降序排序

sorted_idx = np.argsort(model.feature_importances_)[::-1]

Then, it is time to print all sorted importances and the name of columns together as lists (I assume the data loaded with Pandas)然后,是时候将所有排序的重要性和列名一起打印为列表(我假设数据是用 Pandas 加载的)

for index in sorted_idx:
    print([train.columns[index], model.feature_importances_[index]]) 

Furthermore, we can plot the importances with XGboost built-in function此外,我们可以使用 XGboost 内置函数绘制重要性图

plot_importance(model, max_num_features = 15)
pyplot.show()

use max_num_features in plot_importance to limit the number of features if you want.如果需要,可以使用max_num_features中的plot_importance来限制功能的数量。

For feature importance Try this:对于特征重要性试试这个:

Classification:分类:

pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)

Regression:回归:

xgb.plot_importance(bst)

For anyone who comes across this issue while using xgb.XGBRegressor() the workaround I'm using is to keep the data in a pandas.DataFrame() or numpy.array() and not to convert the data to dmatrix() .对于在使用xgb.XGBRegressor()时遇到此问题的任何人,我使用的解决方法是将数据保存在pandas.DataFrame()numpy.array()中,而不是将数据转换为dmatrix() Also, I had to make sure the gamma parameter is not specified for the XGBRegressor.另外,我必须确保没有为 XGBRegressor 指定gamma参数。

fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)

After fitting the regressor fit.feature_importances_ returns an array of weights which I'm assuming is in the same order as the feature columns of the pandas dataframe.拟合回归量后fit.feature_importances_返回一个权重数组,我假设它与 pandas 数据框的特征列的顺序相同。

My current setup is Ubuntu 16.04, Anaconda distro, python 3.6, xgboost 0.6, and sklearn 18.1.我当前的设置是 Ubuntu 16.04、Anaconda 发行版、python 3.6、xgboost 0.6 和 sklearn 18.1。

I don't know how to get values certainly, but there is a good way to plot features importance:我当然不知道如何获取值,但是有一种绘制特征重要性的好方法:

model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

According to this post there 3 different ways to get feature importance from Xgboost:根据这篇文章,有 3 种不同的方法可以从 Xgboost 获得特征重要性:

  • use built-in feature importance,使用内置的特征重要性,
  • use permutation based importance,使用基于排列的重要性,
  • use shap based importance.使用基于形状的重要性。

Built-in feature importance内置特征重要性

Code example:代码示例:

xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")

Please be aware of what type of feature importance you are using.请注意您正在使用哪种类型的功能重要性。 There are several types of importance, see the docs .重要性有几种类型,请参阅文档 The scikit-learn like API of Xgboost is returning gain importance while get_fscore returns weight type. Xgboost 的类似scikit-learn API 返回gain重要性,而get_fscore返回weight类型。

Permutation based importance基于排列的重要性

perm_importance = permutation_importance(xgb, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")

This is my preferred way to compute the importance.这是我计算重要性的首选方法。 However, it can fail in case highly colinear features, so be careful!但是,如果特征高度共线,它可能会失败,所以要小心! It's using permutation_importance from scikit-learn .它使用scikit-learn中的permutation_importance

SHAP based importance基于 SHAP 的重要性

explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

To use the above code, you need to have shap package installed.要使用上面的代码,您需要安装shap包。

I was running the example analysis on Boston data (house price regression from scikit-learn).我正在对波士顿数据运行示例分析(来自 scikit-learn 的房价回归)。 Below 3 feature importance:以下 3 个特征重要性:

Built-in importance内在重要性

内置 xgboost 重要性

Permutation based importance基于排列的重要性

排列重要性

SHAP importance SHAP重要性

小鬼

All plots are for the same model, As you see.如您所见,所有图都是针对同一模型的。 there is a difference in the results.结果有所不同。 I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).我更喜欢基于排列的重要性,因为我清楚地了解哪些特征会影响模型的性能(如果没有高共线性)。

Try this试试这个

fscore = clf.best_estimator_.booster().get_fscore()

In case you are using XGBRegressor, try with: model.get_booster().get_score() .如果您使用的是 XGBRegressor,请尝试使用: model.get_booster().get_score()

That returns the results that you can directly visualize through plot_importance command这将返回您可以通过plot_importance命令直接可视化的结果

None of the above worked for me, this was the code I ended up with, to sort features by importance.以上都不适合我,这是我最终得到的代码,用于按重要性对功能进行排序。

from collections import Counter
Counter({k: v for k, v in sorted(model.get_fscore().items(), key=lambda item: item[1], reverse = True)}).most_common

just replace model with the name of your model and everything will be there.只需用您的模型名称替换模型,一切都会在那里。 Of course I'm doing the same thing twice, there's no need to order a dict before passing to counter, but I figure it wouldn't hurt to leave it there in case anyone hates Counters.当然我做了两次同样的事情,在传递给柜台之前不需要命令命令,但我认为把它留在那里不会有什么坏处,以防有人讨厌柜台。

print(model.feature_importances_)

plt.bar(range(len(model.feature_importances_)), model.feature_importances_)

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore() , but it returns {}我正在使用 xgboost 构建模型,并尝试使用get_fscore()找到每个功能的重要性,但它返回{}

and my train code is:我的火车代码是:

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train?那么我的火车有什么错误吗? How to get feature importance in xgboost?如何在 xgboost 中获得特征重要性?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM