简体   繁体   English

如何获得有关sklearn的随机森林中树木的信息?

[英]How can I get information about the trees in a Random Forest in sklearn?

I would like to learn more about the Random Forest Regressors I am building with sklearn. 我想了解更多关于我用sklearn构建的随机森林回归器的信息。 For example, which depth do the trees have on average if I do not regularise? 例如,如果我不进行正则化处理,这些树平均有多少深度?

The reason for this is that I need to regularise the model and want to get a feeling for what the model looks like at the moment. 原因是我需要对模型进行正则化,并希望对模型当前的状态有所了解。 Also, if I set eg max_leaf_nodes will it still be necessary to also restrict max_depth or will this "problem" sort of solve itself because the tree cannot be grown too deep it max_leaf_nodes is set. 另外,如果我设置了例如max_leaf_nodes ,是否仍然需要限制max_depth还是这种“问题”本身就可以解决,因为不能将树设置为max_leaf_nodes Does this make sense or am I thinking in the wrong direction? 这有意义还是我在错误的方向上思考? I could not find anything in this direction. 我找不到这个方向的任何东西。

If you want to know the average maximum depth of the trees constituting your Random Forest model, you have to access each tree singularly and inquiry for its maximum depth, and then compute a statistic out of the results you obtain. 如果您想了解构成随机森林模型的树木的平均最大深度,则必须单独访问每棵树并查询其最大深度,然后根据获得的结果计算统计量。

Let's first make a reproducible example of a Random Forest classifier model (taken from Scikit-learn documentation ) 让我们首先制作一个随机森林分类器模型的可复制示例(摘自Scikit-learn文档

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

clf = RandomForestClassifier(n_estimators=100,
                             random_state=0)
clf.fit(X, y)

Now we can iterate over its estimators_ attribute containing each decision tree. 现在,我们可以迭代包含每个决策树的estimators_属性。 For each decision tree, we inquiry the attribute tree_.max_depth , store away the response and take an average after completing our iteration: 对于每个决策树,我们查询属性tree_.max_depth ,存储响应并在完成迭代后取平均值:

max_depth = list()
for tree in clf.estimators_:
    max_depth.append(tree.tree_.max_depth)

print("avg max depth %0.1f" % (sum(max_depth) / len(max_depth)))

This will provide you an idea of the average maximum depth of each tree composing your Random Forest model (it works exactly the same also for a regressor model, as you have asked about). 这将使您了解组成您的随机森林模型的每棵树的平均最大深度(正如您所问的,它对于回归模型也完全一样)。

Anyway, as a suggestion, if you want to regularize your model, you have better test parameter hypothesis under a cross-validation and grid/random search paradigm. 无论如何,作为建议,如果您想对模型进行正则化,则可以在交叉验证网格/随机搜索范式下获得更好的测试参数假设。 In such a context you actually don't need to question yourself how hyperparameters interact with each other, you just test different combinations and you get the best combination based on cross validation score. 在这种情况下,您实际上不需要问自己超参数如何相互影响,您只需测试不同的组合即可根据交叉验证得分获得最佳组合。

In addition to @Luca Massaron's answer: 除了@Luca Massaron的答案:

I found https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py which can be applied to each tree in the forest using 我发现https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py可应用于森林使用

for tree in clf.estimators_:

The number of leaf nodes can be calculated like this: 叶子节点的数量可以这样计算:

n_leaves = np.zeros(n_trees, dtype=int)
for i in range(n_trees):
    n_nodes = clf.estimators_[i].tree_.node_count
    # use left or right children as you want 
    children_left = clf.estimators_[i].tree_.children_left
    for x in range(n_nodes):
        if children_left[x] == -1:
            n_leaves[i] += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM