简体   繁体   English

如何在多类分类设置中从 Logitboost 算法中提取特征重要性?

[英]How to extract the feature importances from the Logitboost algorithm in a multi-class classification setting?

I am currently running a multi-class Logitboost algorithm ( docs ), which works great.我目前正在运行一种多类 Logitboost 算法 ( docs ),效果很好。 However, when trying to view the importances of different features I get this error message:但是,在尝试查看不同功能的重要性时,我收到此错误消息:

NotImplementedError: Feature importances is currently only implemented for binary classification tasks.

When looking at the Github code, I don't really understand why this has not been implemented yet.在看Github代码的时候,不太明白为什么这个还没有实现。 Does anybody know any way around this such that I can plot the feature importances or is there nothing I can do unless wait for a newer version of Logitboost (which doesn't seem that likely seeing as the last update was several months ago).有没有人知道解决这个问题的任何方法,这样我就可以 plot 功能重要性,或者除非等待更新版本的 Logitboost(这似乎不太可能看到,因为最后一次更新是几个月前),否则我无能为力。

I have already tried to modify the Logitboost.py file, but seeing as I have limited knowledge about programming, this is a rather tedious process.我已经尝试修改 Logitboost.py 文件,但鉴于我对编程的了解有限,这是一个相当繁琐的过程。

Thanks in advance!提前致谢!

By looking a bit into the source code, the base_estimator defaults to a DecisionTree :通过查看源代码, base_estimator默认为DecisionTree

# The default regressor for LogitBoost is a decision stump
_BASE_ESTIMATOR_DEFAULT = DecisionTreeRegressor(max_depth=1)

Which we know does have feature importances, though apparently this mod does not yet implement this method for multiclass problems.我们知道它确实具有重要的特征,尽管显然这个 mod 还没有为多类问题实现这个方法。 By looking into the structure of the fitted classifier though, it seems fairly simple to come up with some custom importance metric.不过,通过研究拟合分类器的结构,想出一些自定义重要性指标似乎相当简单。

Let's see with an example, using the iris dataset:让我们看一个例子,使用 iris 数据集:

import logitboost
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train , X_test, y_train, y_test = train_test_split(X,y)
lg = logitboost.LogitBoost()
lg.fit(X_train, y_train)

If you look at lg.estimators_ , you'll see that the structure is a nested list of fitted decision trees.如果您查看lg.estimators_ ,您会发现该结构是嵌套的拟合决策树列表。 We could do something as follows to get the overall importance:我们可以做如下事情来获得整体重要性:

l_feat_imp = [sum(cls.feature_importances_ for cls in cls_list) 
              for cls_list in lg.estimators_]
imp = np.array(l_feat_imp).sum(0)
# array([ 9., 19., 51., 71.])

ie this is just taking the sum of the contributions of each features for all inner lists of estimators, and then again over the summed contributions.也就是说,这只是对所有内部估计量列表的每个特征的贡献求和,然后再次对总和贡献求和。 So in this case we'd have:所以在这种情况下我们会有:

pd.Series(imp, index=load_iris().feature_names).sort_values(ascending=False).plot.bar()

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM