[英]Why is (rf)clf feature_importance giving importance to a feature where all values are the same?
I am comparing multi-classification with Random Forests and CART in scikit-learn. 我在scikit-learn中比较多分类与随机森林和CART。
Two of my features (feature 4 and feature 6) are irrelevant for the classification because all the values are the same. 我的两个特征(特征4和特征6)与分类无关,因为所有值都相同。 But output of the feature_importances of the RandomForestClassifier is the following:
但是RandomForestClassifier的feature_importances的输出如下:
Feature ranking:
功能排名:
- feature 3 (0.437165)
特色3(0.437165)
- feature 2 (0.216415)
特色2(0.216415)
- feature 6 (0.102238)
特写6(0.102238)
- feature 5 (0.084897)
特写5(0.084897)
- feature 1 (0.064624)
特征1(0.064624)
- feature 4 (0.059332)
特写4(0.059332)
- feature 0 (0.035328)
特征0(0.035328)
CART feature_importance output: CART feature_importance输出:
Feature ranking:
功能排名:
- feature 3 (0.954666)
特写3(0.954666)
- feature 6 (0.014117)
特写6(0.014117)
- feature 0 (0.011529)
特征0(0.011529)
- feature 1 (0.010586)
特色1(0.010586)
- feature 2 (0.006785)
特色2(0.006785)
- feature 4 (0.002204)
特写4(0.002204)
- feature 5 (0.000112)
特征5(0.000112)
In every row, feature 4 has the same value. 在每一行中,特征4具有相同的值。 Same is for feature 6.
功能6也是如此。
Here is the code 这是代码
importances = rfclf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfclf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(x.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfclf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(x.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
I would except that the importance is like: 我会除了重要性如下:
- feature 6 (0.000000)
特征6(0.000000)
- feature 4 (0.000000)
特征4(0.000000)
When I just don't use that two features, my models overfit. 当我只是不使用那两个功能时,我的模特适合。
You need to set a limit to the depth of your trees. 您需要设置树木深度的限制。 I recommend doing a gridsearch over min_samples_leaf=[0.001, 0.1] - trying between 0.1% to 10% in each leaf.
我建议在min_samples_leaf = [0.001,0.1]上进行网格搜索 - 在每个叶子中尝试0.1%到10%之间。
Any kind of feature importance calculation must be done on a robust model to be meaningful. 必须在强大的模型上进行任何类型的特征重要性计算才有意义。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.