[英]Is there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features?
Here is a link to a .csv file. 这是指向 .csv文件的链接 。 This is a classic dataset that can be used to practice decision trees on! 这是一个经典的数据集,可用于实践决策树!
import pandas as pd
import numpy as np
import scipy as sc
import scipy.stats
from math import log
import operator
df = pd.read_csv('tennis.csv')
target = df['play']
target.columns = ['play']
features_dataframe = df.loc[:, df.columns != 'play']
Here is where my headache begins 这是我头痛的开始
features_dataframe = pd.get_dummies(features_dataframe)
features_dataframe.columns
I'm performing one hot encoding on my feature (data) columns stored in features_dataframe
which are all categorical and printing it, returns 我对存储在features_dataframe
中的所有要素(数据)列执行一种热编码,这些列都是分类的并打印,返回
Index(['windy', 'outlook_overcast', 'outlook_rainy', 'outlook_sunny',
'temp_cool', 'temp_hot', 'temp_mild', 'humidity_high',
'humidity_normal'],
dtype='object')
I get why one-hot encoding needs to be performed! 我明白了为什么需要执行一键编码! sklearn won't work on columns that are categorical. sklearn不适用于分类列。
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(target.values)
k = le.transform(target.values)
The above code converts my target column stored in target
which essentially has binary class labels ("yes" and "no") to integer, because sklearn won't work with categories (YAY!) 上面的代码将存储在target
目标列(该目标列实际上具有二进制类标签(“ yes”和“ no”))转换为整数,因为sklearn无法使用类别(YAY!)。
and now finally, fitting the DecisionTreeClassifier, criterion = "entropy"
is what I'm assuming uses ID3 concept! 现在,最后,我将使用ID3概念来拟合DecisionTreeClassifier, criterion = "entropy"
!
from sklearn import tree
from os import system
dtree = tree.DecisionTreeClassifier(criterion = "entropy")
dtree = dtree.fit(features_dataframe, k)
dotfile = open("id3.dot", 'w')
tree.export_graphviz(dtree, out_file = dotfile, feature_names = features_dataframe.columns)
dotfile.close()
The file id3.dot
has the necessary code which can be pasted on this site , to convert a digraph code to a proper understandable visualization! 文件id3.dot
具有可粘贴在此站点上的必需代码,以将有向图代码转换为适当的可理解的可视化!
For you to effectively and easily help me, I will post the code of id3.dot
over here! 为了方便您有效地帮助我,我将在此处发布id3.dot
的代码!
digraph Tree {
node [shape=box] ;
0 [label="outlook_overcast <= 0.5\nentropy = 0.94\nsamples = 14\nvalue = [5, 9]"] ;
1 [label="humidity_high <= 0.5\nentropy = 1.0\nsamples = 10\nvalue = [5, 5]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="windy <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [1, 4]"] ;
1 -> 2 ;
3 [label="entropy = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
2 -> 3 ;
4 [label="outlook_rainy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
2 -> 4 ;
5 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
4 -> 6 ;
7 [label="outlook_sunny <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
1 -> 7 ;
8 [label="windy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
7 -> 8 ;
9 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
8 -> 9 ;
10 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
8 -> 10 ;
11 [label="entropy = 0.0\nsamples = 3\nvalue = [3, 0]"] ;
7 -> 11 ;
12 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 4]"] ;
0 -> 12 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}
Go here , and paste the above digraph code to get a proper visualization of the decision tree created! 转到此处 ,粘贴上面的有向图代码,以正确可视化创建的决策树! The problem here is that for larger trees and larger datasets, it will be so hard to interpret because of the one hot encoded features being displayed as feature names representing node splits! 这里的问题是,对于较大的树和较大的数据集,将很难解释,因为将一个热编码要素显示为代表节点拆分的要素名称!
Is there a work around where, the decision tree visualization will show consolidated feature names to represent node splits from the one-hot-encoded features? 是否有解决方案,决策树可视化将在何处显示合并的特征名称,以表示从一键编码特征中分离出的节点?
What I mean by this, is there a way to create a decision tree visualization like this 我的意思是,有没有办法像这样创建决策树可视化
It's probably simpler to just not use One-Hot Encoding but instead use some arbitrary integer codes for the categories of a specific feature. 仅不使用“一键编码”,而是对特定功能的类别使用一些任意整数代码,可能会更简单。
You can use pandas.factorize
to integer-code categorical variables. 您可以使用pandas.factorize
对整数变量进行整数编码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.