[英]is there any way to get samples under each leaf of a decision tree?
I have trained a decision tree using a dataset. 我使用数据集训练了决策树。 Now I want to see which samples fall under which leaf of the tree. 现在我想看看哪些样本落在树的哪个叶子下面。
From here I want the red circled samples. 从这里我想要红色圆圈样本。
I am using Python's Sklearn's implementation of decision tree . 我正在使用Python的Sklearn的决策树实现。
If you want only the leaf for each sample you can just use 如果您只想要每个样品的叶子,您可以使用
clf.apply(iris.data)
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 14, 5, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 16, 16, 16, 16, 16, 16, 6, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 8, 16, 16, 16, 16, 16, 16, 15, 16, 16, 11, 16, 16, 16, 8, 8, 16, 16, 16, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]) 数组([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ,1,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,14,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,14,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 ,5,5,5,10,5,5,5,5,5,10,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,10,5,5,5,5,10,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,10,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 ,5,16,16,16,16,16,16,6,16,16,16,16,16,16,16,16,16,16,16,16,8,16,16,16,16 ,16,16,15,16,16,11,16,16,16,8,8,16,16,16,15,16,16,16,16,16,16,16,16,16,16 ,16])
If you want to get all samples for each node you could calculate all the decision paths with 如果要获取每个节点的所有样本,可以使用计算所有决策路径
dec_paths = clf.decision_path(iris.data)
Then loop over the decision paths, convert them to arrays with toarray()
and check whether they belong to a node or not. 然后遍历决策路径,使用toarray()
将它们转换为数组,并检查它们是否属于某个节点。 Everything is stored in a defaultdict
where the key is the node number and the values are the sample number. 所有内容都存储在defaultdict
,其中键是节点编号,值是样本编号。
for d, dec in enumerate(dec_paths):
for i in range(clf.tree_.node_count):
if dec.toarray()[0][i] == 1:
samples[i].append(d)
Complete code 完整的代码
import sklearn.datasets
import sklearn.tree
import collections
clf = sklearn.tree.DecisionTreeClassifier(random_state=42)
iris = sklearn.datasets.load_iris()
clf = clf.fit(iris.data, iris.target)
samples = collections.defaultdict(list)
dec_paths = clf.decision_path(iris.data)
for d, dec in enumerate(dec_paths):
for i in range(clf.tree_.node_count):
if dec.toarray()[0][i] == 1:
samples[i].append(d)
Output 产量
print(samples[13])
[70, 126, 138] [70,126,138]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.