简体   繁体   English

为什么(rf)clf feature_importance重视所有值相同的功能?

[英]Why is (rf)clf feature_importance giving importance to a feature where all values are the same?

I am comparing multi-classification with Random Forests and CART in scikit-learn. 我在scikit-learn中比较多分类与随机森林和CART。

Two of my features (feature 4 and feature 6) are irrelevant for the classification because all the values are the same. 我的两个特征(特征4和特征6)与分类无关,因为所有值都相同。 But output of the feature_importances of the RandomForestClassifier is the following: 但是RandomForestClassifier的feature_importances的输出如下:

Feature ranking: 功能排名:

  1. feature 3 (0.437165) 特色3(0.437165)
  2. feature 2 (0.216415) 特色2(0.216415)
  3. feature 6 (0.102238) 特写6(0.102238)
  4. feature 5 (0.084897) 特写5(0.084897)
  5. feature 1 (0.064624) 特征1(0.064624)
  6. feature 4 (0.059332) 特写4(0.059332)
  7. feature 0 (0.035328) 特征0(0.035328)

CART feature_importance output: CART feature_importance输出:

Feature ranking: 功能排名:

  1. feature 3 (0.954666) 特写3(0.954666)
  2. feature 6 (0.014117) 特写6(0.014117)
  3. feature 0 (0.011529) 特征0(0.011529)
  4. feature 1 (0.010586) 特色1(0.010586)
  5. feature 2 (0.006785) 特色2(0.006785)
  6. feature 4 (0.002204) 特写4(0.002204)
  7. feature 5 (0.000112) 特征5(0.000112)

In every row, feature 4 has the same value. 在每一行中,特征4具有相同的值。 Same is for feature 6. 功能6也是如此。

Here is the code 这是代码

Random Forest 随机森林

importances = rfclf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfclf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

CART 大车

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfclf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

I would except that the importance is like: 我会除了重要性如下:

  1. feature 6 (0.000000) 特征6(0.000000)
  2. feature 4 (0.000000) 特征4(0.000000)

When I just don't use that two features, my models overfit. 当我只是不使用那两个功能时,我的模特适合。

You need to set a limit to the depth of your trees. 您需要设置树木深度的限制。 I recommend doing a gridsearch over min_samples_leaf=[0.001, 0.1] - trying between 0.1% to 10% in each leaf. 我建议在min_samples_leaf = [0.001,0.1]上进行网格搜索 - 在每个叶子中尝试0.1%到10%之间。

Any kind of feature importance calculation must be done on a robust model to be meaningful. 必须在强大的模型上进行任何类型的特征重要性计算才有意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM