简体   繁体   English

使用 xgb 和 Ranger 的特征重要性图。 最好的比较方法

[英]Feature importance plot using xgb and also ranger. Best way to compare

I'm working on a script that trains both a ranger random forest and a xgb regression.我正在编写一个脚本来训练游侠随机森林和 xgb 回归。 Depending on which performs best based on rmse, one or the other is used to test against hold out data.根据基于 rmse 的哪个表现最佳,使用一个或另一个来测试保留数据。

I would also like to return feature importance for both in a comparable way.我还想以类似的方式返回两者的特征重要性。

With the xgboost library, I can get my feature importance table and plot like so:使用 xgboost 库,我可以获得我的特征重要性表和绘图,如下所示:

> xgb.importance(model = regression_model)
                 Feature        Gain       Cover  Frequency
1:              spend_7d 0.981006272 0.982513621 0.79219969
2:                   IOS 0.006824499 0.011105014 0.08112324
3:  is_publisher_organic 0.006379284 0.002917203 0.06770671
4: is_publisher_facebook 0.005789945 0.003464162 0.05897036

Then I can plot it like so:然后我可以像这样绘制它:

> xgb.importance(model = regression_model) %>% xgb.plot.importance()

在此处输入图片说明

That was using xgboost library and their functions.那是使用 xgboost 库及其功能。 With ranger random forrest, if I fit a regression model, I can get feature importance if I include importance = 'impurity' while fitting the model.使用 Ranger random forrest,如果我拟合回归模型,如果我在拟合模型时包含importance = 'impurity' ,我可以获得特征重要性。 Then:然后:

regression_model$variable.importance
             spend_7d        d7_utility_sum  recent_utility_ratio                   IOS  is_publisher_organic is_publisher_facebook 
         437951687132                     0                     0             775177421             600401959            1306174807 

I could just create a ggplot.我可以创建一个ggplot。 But the scales are entirely different between what ranger returns in that table and what xgb shows in the plot.但是,游侠在该表中返回的内容与 xgb 在图中显示的内容之间的比例完全不同。

Is there an out of the box library or solution where I can plot the feature importance of either the xgb or ranger model in a comparable way?是否有开箱即用的库或解决方案,我可以以类似的方式绘制 xgb 或 Ranger 模型的特征重要性?

Both the column "Gain" of XGboost and the importances of ranger with parameter "impurity" are constructed via the total decrease in impurity (therefore gain) of the splits of a given variable. XGboost 的“增益”列和带参数“杂质”的 Ranger 的重要性都是通过给定变量拆分的杂质总减少(因此增益)来构建的。

The only difference appears to be that while XGboost automatically makes the importances in percentage form, ranger keeps them as original values, so sum of squares, which is not very handy to be plotted.唯一的区别似乎是,虽然 XGboost 会自动以百分比形式显示重要性,但 Ranger 将它们保留为原始值,因此平方和不太容易绘制。 You can therefore transform the values of ranger importances by dividing them by the total sum, so that you will have the equivalent percentages as in Xgboost.因此,您可以通过将游侠重要性的值除以总和来转换它们,这样您将拥有与 Xgboost 中相同的百分比。

Since using impurity decrease can be sometimes misleading, I however suggest you compute (for both models) the importances of the variables via permutation.由于使用杂质减少有时会产生误导,但我建议您通过排列计算(对于两种模型)变量的重要性。 This allows you to get the importances in an easy way that is comparable for the different models, and it is more stable.这使您能够以一种简单的方式获得不同模型之间具有可比性的重要性,并且更加稳定。

I suggest this incredibly helpful post我建议这个非常有用的帖子

Here is the permutation importance, as defined in there (sorry it's Python, not R):这是排列重要性,如那里所定义(抱歉,它是 Python,而不是 R):

def permutation_importances(rf, X_train, y_train, metric):
  baseline = metric(rf, X_train, y_train)
  imp = []
  for col in X_train.columns:
    save = X_train[col].copy()
    X_train[col] = np.random.permutation(X_train[col])
    m = metric(rf, X_train, y_train)
    X_train[col] = save
    imp.append(baseline - m)
return np.array(imp)

However, ranger also allows for permutation importances to be computed via importance="permutation" , and xgboost might do so as well.然而,ranger 还允许通过importance="permutation"计算排列重要性,xgboost 也可以这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM