有没有办法在决策树的每个叶子下面获取样本？

Question

I have trained a decision tree using a dataset. 我使用数据集训练了决策树。 Now I want to see which samples fall under which leaf of the tree. 现在我想看看哪些样本落在树的哪个叶子下面。

From here I want the red circled samples. 从这里我想要红色圆圈样本。

I am using Python's Sklearn's implementation of decision tree . 我正在使用Python的Sklearn的决策树实现。

Answer 1

If you want only the leaf for each sample you can just use 如果您只想要每个样品的叶子，您可以使用

clf.apply(iris.data)

array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 14, 5, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 16, 16, 16, 16, 16, 16, 6, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 8, 16, 16, 16, 16, 16, 16, 15, 16, 16, 11, 16, 16, 16, 8, 8, 16, 16, 16, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]) 数组（[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ，1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ，1,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,14,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,14,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 ，5,5,5,10,5,5,5,5,5,10,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,10,5,5,5,5,10,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,10,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 ，5,16,16,16,16,16,16,6,16,16,16,16,16,16,16,16,16,16,16,16,8,16,16,16,16 ，16,16,15,16,16,11,16,16,16,8,8,16,16,16,15,16,16,16,16,16,16,16,16,16,16 ，16]）

If you want to get all samples for each node you could calculate all the decision paths with 如果要获取每个节点的所有样本，可以使用计算所有决策路径

dec_paths = clf.decision_path(iris.data)

Then loop over the decision paths, convert them to arrays with toarray() and check whether they belong to a node or not. 然后遍历决策路径，使用toarray()将它们转换为数组，并检查它们是否属于某个节点。 Everything is stored in a defaultdict where the key is the node number and the values are the sample number. 所有内容都存储在defaultdict ，其中键是节点编号，值是样本编号。

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d)

Complete code 完整的代码

import sklearn.datasets
import sklearn.tree
import collections

clf = sklearn.tree.DecisionTreeClassifier(random_state=42)
iris = sklearn.datasets.load_iris()
clf = clf.fit(iris.data, iris.target)

samples = collections.defaultdict(list)
dec_paths = clf.decision_path(iris.data)

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d)

Output 产量

print(samples[13])

[70, 126, 138] [70,126,138]

有没有办法在决策树的每个叶子下面获取样本？

问题描述

1 个解决方案

解决方案1
11 已采纳 2017-07-30 11:00:47

有没有办法在决策树的每个叶子下面获取样本？

问题描述

1 个解决方案

解决方案1 11 已采纳 2017-07-30 11:00:47

解决方案1
11 已采纳 2017-07-30 11:00:47