Here is a link to a .csv file. This is a classic dataset that can be used to practice decision trees on!
import pandas as pd
import numpy as np
import scipy as sc
import scipy.stats
from math import log
import operator
df = pd.read_csv('tennis.csv')
target = df['play']
target.columns = ['play']
features_dataframe = df.loc[:, df.columns != 'play']
Here is where my headache begins
features_dataframe = pd.get_dummies(features_dataframe)
features_dataframe.columns
I'm performing one hot encoding on my feature (data) columns stored in features_dataframe
which are all categorical and printing it, returns
Index(['windy', 'outlook_overcast', 'outlook_rainy', 'outlook_sunny',
'temp_cool', 'temp_hot', 'temp_mild', 'humidity_high',
'humidity_normal'],
dtype='object')
I get why one-hot encoding needs to be performed! sklearn won't work on columns that are categorical.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(target.values)
k = le.transform(target.values)
The above code converts my target column stored in target
which essentially has binary class labels ("yes" and "no") to integer, because sklearn won't work with categories (YAY!)
and now finally, fitting the DecisionTreeClassifier, criterion = "entropy"
is what I'm assuming uses ID3 concept!
from sklearn import tree
from os import system
dtree = tree.DecisionTreeClassifier(criterion = "entropy")
dtree = dtree.fit(features_dataframe, k)
dotfile = open("id3.dot", 'w')
tree.export_graphviz(dtree, out_file = dotfile, feature_names = features_dataframe.columns)
dotfile.close()
The file id3.dot
has the necessary code which can be pasted on this site , to convert a digraph code to a proper understandable visualization!
For you to effectively and easily help me, I will post the code of id3.dot
over here!
digraph Tree {
node [shape=box] ;
0 [label="outlook_overcast <= 0.5\nentropy = 0.94\nsamples = 14\nvalue = [5, 9]"] ;
1 [label="humidity_high <= 0.5\nentropy = 1.0\nsamples = 10\nvalue = [5, 5]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="windy <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [1, 4]"] ;
1 -> 2 ;
3 [label="entropy = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
2 -> 3 ;
4 [label="outlook_rainy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
2 -> 4 ;
5 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
4 -> 6 ;
7 [label="outlook_sunny <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
1 -> 7 ;
8 [label="windy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
7 -> 8 ;
9 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
8 -> 9 ;
10 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
8 -> 10 ;
11 [label="entropy = 0.0\nsamples = 3\nvalue = [3, 0]"] ;
7 -> 11 ;
12 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 4]"] ;
0 -> 12 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}
Go here , and paste the above digraph code to get a proper visualization of the decision tree created! The problem here is that for larger trees and larger datasets, it will be so hard to interpret because of the one hot encoded features being displayed as feature names representing node splits!
Is there a work around where, the decision tree visualization will show consolidated feature names to represent node splits from the one-hot-encoded features?
What I mean by this, is there a way to create a decision tree visualization like this
It's probably simpler to just not use One-Hot Encoding but instead use some arbitrary integer codes for the categories of a specific feature.
You can use pandas.factorize
to integer-code categorical variables.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.