简体   繁体   English

有什么方法可以可视化决策树(sklearn),其中分类特征从一个热编码特征中合并而来?

[英]Is there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features?

Here is a link to a .csv file. 这是指向 .csv文件的链接 This is a classic dataset that can be used to practice decision trees on! 这是一个经典的数据集,可用于实践决策树!

import pandas as pd
import numpy as np
import scipy as sc
import scipy.stats
from math import log
import operator

df = pd.read_csv('tennis.csv')

target = df['play']
target.columns = ['play']
features_dataframe = df.loc[:, df.columns != 'play']

Here is where my headache begins 这是我头痛的开始

features_dataframe = pd.get_dummies(features_dataframe) 
features_dataframe.columns

I'm performing one hot encoding on my feature (data) columns stored in features_dataframe which are all categorical and printing it, returns 我对存储在features_dataframe中的所有要素(数据)列执行一种热编码,这些列都是分类的并打印,返回

Index(['windy', 'outlook_overcast', 'outlook_rainy', 'outlook_sunny',
   'temp_cool', 'temp_hot', 'temp_mild', 'humidity_high',
   'humidity_normal'],
  dtype='object')

I get why one-hot encoding needs to be performed! 我明白了为什么需要执行一键编码! sklearn won't work on columns that are categorical. sklearn不适用于分类列。

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(target.values)

k = le.transform(target.values)

The above code converts my target column stored in target which essentially has binary class labels ("yes" and "no") to integer, because sklearn won't work with categories (YAY!) 上面的代码将存储在target目标列(该目标列实际上具有二进制类标签(“ yes”和“ no”))转换为整数,因为sklearn无法使用类别(YAY!)。

and now finally, fitting the DecisionTreeClassifier, criterion = "entropy" is what I'm assuming uses ID3 concept! 现在,最后,我将使用ID3概念来拟合DecisionTreeClassifier, criterion = "entropy"

from sklearn import tree
from os import system

dtree = tree.DecisionTreeClassifier(criterion = "entropy")
dtree = dtree.fit(features_dataframe, k)


dotfile = open("id3.dot", 'w')
tree.export_graphviz(dtree, out_file = dotfile, feature_names = features_dataframe.columns)
dotfile.close()

The file id3.dot has the necessary code which can be pasted on this site , to convert a digraph code to a proper understandable visualization! 文件id3.dot具有可粘贴在此站点上的必需代码,以将有向图代码转换为适当的可理解的可视化!

For you to effectively and easily help me, I will post the code of id3.dot over here! 为了方便您有效地帮助我,我将在此处发布id3.dot的代码!

digraph Tree {
node [shape=box] ;
0 [label="outlook_overcast <= 0.5\nentropy = 0.94\nsamples = 14\nvalue = [5, 9]"] ;
1 [label="humidity_high <= 0.5\nentropy = 1.0\nsamples = 10\nvalue = [5, 5]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="windy <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [1, 4]"] ;
1 -> 2 ;
3 [label="entropy = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
2 -> 3 ;
4 [label="outlook_rainy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
2 -> 4 ;
5 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
4 -> 6 ;
7 [label="outlook_sunny <= 0.5\nentropy = 0.722\nsamples = 5\nvalue = [4, 1]"] ;
1 -> 7 ;
8 [label="windy <= 0.5\nentropy = 1.0\nsamples = 2\nvalue = [1, 1]"] ;
7 -> 8 ;
9 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
8 -> 9 ;
10 [label="entropy = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
8 -> 10 ;
11 [label="entropy = 0.0\nsamples = 3\nvalue = [3, 0]"] ;
7 -> 11 ;
12 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 4]"] ;
0 -> 12 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}

Go here , and paste the above digraph code to get a proper visualization of the decision tree created! 转到此处 ,粘贴上面的有向图代码,以正确可视化创建的决策树! The problem here is that for larger trees and larger datasets, it will be so hard to interpret because of the one hot encoded features being displayed as feature names representing node splits! 这里的问题是,对于较大的树和较大的数据集,将很难解释,因为将一个热编码要素显示为代表节点拆分的要素名称!

Is there a work around where, the decision tree visualization will show consolidated feature names to represent node splits from the one-hot-encoded features? 是否有解决方案,决策树可视化将在何处显示合并的特征名称,以表示从一键编码特征中分离出的节点?

What I mean by this, is there a way to create a decision tree visualization like this 我的意思是,有没有办法像这样创建决策树可视化

It's probably simpler to just not use One-Hot Encoding but instead use some arbitrary integer codes for the categories of a specific feature. 仅不使用“一键编码”,而是对特定功能的类别使用一些任意整数代码,可能会更简单。

You can use pandas.factorize to integer-code categorical variables. 您可以使用pandas.factorize对整数变量进行整数编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 具有一个热编码特征的Auto-Sklearn中的特征和特征重要性 - Features and Feature importance in Auto-Sklearn with One Hot Encoded Features 如何将混合(分类和数字)特征传递给 sklearn 中的决策树回归器? - how to pass mixed (categorical and numeric) features to Decision Tree Regressor in sklearn? 一种热编码分类特征,用作sklearn中具有数字特征的训练数据 - One hot encoding categorical features to use as training data with numerical features in sklearn 在决策树分类器中使用OneHotEncoder进行分类功能 - Using OneHotEncoder for categorical features in decision tree classifier Python sklearn决策树分类器具有多个功能? - Python sklearn decision tree classifier with multiple features? 以最快的方式将单热编码功能保存到Pandas DataFrame中 - Save one-hot-encoded features into Pandas DataFrame the fastest way 字符串分类功能的一种热编码 - One hot encoding of string categorical features 当使用多个功能时,pd.get_dummies() 不会将分类数据转换为一个热编码向量 - pd.get_dummies() not converting categorical data to one hot encoded vectors when multiple features are used 一种热编码数据的决策树直觉 - Decision tree intuition for one hot encoded data 二进制特征应该是one-hot编码吗? - Should binary features be one-hot encoded?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM