为什么（rf）clf feature_importance重视所有值相同的功能？

Question

I am comparing multi-classification with Random Forests and CART in scikit-learn. 我在scikit-learn中比较多分类与随机森林和CART。

Two of my features (feature 4 and feature 6) are irrelevant for the classification because all the values are the same. 我的两个特征（特征4和特征6）与分类无关，因为所有值都相同。 But output of the feature_importances of the RandomForestClassifier is the following: 但是RandomForestClassifier的feature_importances的输出如下：

Feature ranking: 功能排名：

feature 3 (0.437165) 特色3（0.437165）

feature 2 (0.216415) 特色2（0.216415）

feature 6 (0.102238) 特写6（0.102238）

feature 5 (0.084897) 特写5（0.084897）

feature 1 (0.064624) 特征1（0.064624）

feature 4 (0.059332) 特写4（0.059332）

feature 0 (0.035328) 特征0（0.035328）

CART feature_importance output: CART feature_importance输出：

Feature ranking: 功能排名：

feature 3 (0.954666) 特写3（0.954666）

feature 6 (0.014117) 特写6（0.014117）

feature 0 (0.011529) 特征0（0.011529）

feature 1 (0.010586) 特色1（0.010586）

feature 2 (0.006785) 特色2（0.006785）

feature 4 (0.002204) 特写4（0.002204）

feature 5 (0.000112) 特征5（0.000112）

In every row, feature 4 has the same value. 在每一行中，特征4具有相同的值。 Same is for feature 6. 功能6也是如此。

Here is the code 这是代码

Random Forest 随机森林

importances = rfclf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfclf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

CART 大车

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfclf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

I would except that the importance is like: 我会除了重要性如下：

feature 6 (0.000000) 特征6（0.000000）

feature 4 (0.000000) 特征4（0.000000）

When I just don't use that two features, my models overfit. 当我只是不使用那两个功能时，我的模特适合。

Answer 1

You need to set a limit to the depth of your trees. 您需要设置树木深度的限制。 I recommend doing a gridsearch over min_samples_leaf=[0.001, 0.1] - trying between 0.1% to 10% in each leaf. 我建议在min_samples_leaf = [0.001,0.1]上进行网格搜索 - 在每个叶子中尝试0.1％到10％之间。

Any kind of feature importance calculation must be done on a robust model to be meaningful. 必须在强大的模型上进行任何类型的特征重要性计算才有意义。

为什么（rf）clf feature_importance重视所有值相同的功能？

问题描述

Random Forest 随机森林

CART 大车

1 个解决方案

解决方案1
0 2019-03-29 17:02:21

为什么（rf）clf feature_importance重视所有值相同的功能？

问题描述

Random Forest 随机森林

CART 大车

1 个解决方案

解决方案1 0 2019-03-29 17:02:21

解决方案1
0 2019-03-29 17:02:21