简体   繁体   中英

Is there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features?

Here is a link to a .csv file. This is a classic dataset that can be used to practice decision trees on!

import pandas as pd
import numpy as np
import scipy as sc
import scipy.stats
from math import log
import operator

df = pd.read_csv('tennis.csv')

target = df['play']
target.columns = ['play']
features_dataframe = df.loc[:, df.columns != 'play']

Here is where my headache begins

features_dataframe = pd.get_dummies(features_dataframe) 
features_dataframe.columns

I'm performing one hot encoding on my feature (data) columns stored in features_dataframe which are all categorical and printing it, returns

Index(['windy', 'outlook_overcast', 'outlook_rainy', 'outlook_sunny',
   'temp_cool', 'temp_hot', 'temp_mild', 'humidity_high',
   'humidity_normal'],
  dtype='object')

I get why one-hot encoding needs to be performed! sklearn won't work on columns that are categorical.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(target.values)

k = le.transform(target.values)

The above code converts my target column stored in target which essentially has binary class labels ("yes" and "no") to integer, because sklearn won't work with categories (YAY!)

and now finally, fitting the DecisionTreeClassifier, criterion = "entropy" is what I'm assuming uses ID3 concept!

from sklearn import tree
from os import system

dtree = tree.DecisionTreeClassifier(criterion = "entropy")
dtree = dtree.fit(features_dataframe, k)


dotfile = open("id3.dot", 'w')
tree.export_graphviz(dtree, out_file = dotfile, feature_names = features_dataframe.columns)
dotfile.close()

The file id3.dot has the necessary code which can be pasted on this site , to convert a digraph code to a proper understandable visualization!

For you to effectively and easily help me, I will post the code of id3.dot over here!

digraph Tree {
node [shape=box] ;
0 [label="outlook_overcast <= 0.5\nentropy = 0.94\nsamples = 14\nvalue = [5, 9]"] ;
1 [label="humidity_high <= 0.5\nentropy = 1.0\nsamples = 10\nvalue = [5, 5]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="windy <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [1, 4]"] ;
1 -> 2 ;
3 [label="entropy = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
2 -> 3 ;
4 [label="outlook_rainy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
2 -> 4 ;
5 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
4 -> 6 ;
7 [label="outlook_sunny <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
1 -> 7 ;
8 [label="windy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
7 -> 8 ;
9 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
8 -> 9 ;
10 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
8 -> 10 ;
11 [label="entropy = 0.0\nsamples = 3\nvalue = [3, 0]"] ;
7 -> 11 ;
12 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 4]"] ;
0 -> 12 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}

Go here , and paste the above digraph code to get a proper visualization of the decision tree created! The problem here is that for larger trees and larger datasets, it will be so hard to interpret because of the one hot encoded features being displayed as feature names representing node splits!

Is there a work around where, the decision tree visualization will show consolidated feature names to represent node splits from the one-hot-encoded features?

What I mean by this, is there a way to create a decision tree visualization like this

It's probably simpler to just not use One-Hot Encoding but instead use some arbitrary integer codes for the categories of a specific feature.

You can use pandas.factorize to integer-code categorical variables.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM