简体   繁体   English

在决策树分类器中使用OneHotEncoder进行分类功能

[英]Using OneHotEncoder for categorical features in decision tree classifier

I am new to ML in Python and very confused by how to implement a decision tree with categorical variables as they get automatically encoded by party and ctree in R . 我是Python的ML新手,对于如何使用分类变量实现决策树感到非常困惑,因为它们由R中的partyctree自动编码。

I want to make a decision tree with two categorical independent features and one dependent class. 我想创建一个具有两个分类独立特征和一个相关类的决策树。

The dataframe I am using looks like this: 我正在使用的数据框如下所示:

data
      title_overlap_quartile sales_rank_quartile rank_grp
    0                     Q4                  Q2    GRP 1
    1                     Q4                  Q3    GRP 1
    2                     Q2                  Q1    GRP 1
    3                     Q4                  Q1    GRP 1
    5                     Q2                  Q1    GRP 2

I understood that categorical features need to be encoded in scikit learn using labelencoder and/or one hot encoder. 我了解到,需要使用labelencoder和/或一个热编码器在scikit学习中对分类特征进行编码。

First I tried to just use label encoder but that does not solve the problem since DecisionTreeClassifier started treating the encoded variables as continuous. 首先,我尝试仅使用标签编码器,但这不能解决问题,因为DecisionTreeClassifier开始将编码后的变量视为连续变量。 Then I read from this post: Issue with OneHotEncoder for categorical features that the variable should first be encoded using label encoder and then encoded again using one hot encoder. 然后,我从这篇文章中阅读: 有关分类功能的OneHotEncoder问题,应首先使用标签编码器对变量进行编码,然后再使用一个热编码器对变量进行再次编码。

I tried to implement that on this dataset in the following way but am getting an error. 我尝试通过以下方式在此数据集上实现该功能,但出现错误。

def encode_features(df, columns):
    le = preprocessing.LabelEncoder()
    ohe = preprocessing.OneHotEncoder(sparse=False)
    for i in columns:
        le.fit(df[i].unique())
        df[i+'_le'] = le.transform(df[i])
        df[i+'_le'] = df[i+'_le'].values.reshape(-1, 1)
        df[i+'_le'+'_ohe'] = ohe.fit_transform(df[i+'_le'])
    return(df)

data = encode_features(data, ['title_overlap_quartile', 'sales_rank_quartile'])


  File "/Users/vaga/anaconda2/envs/py36/lib/python3.5/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')

ValueError: Length of values does not match length of index

When I remove the ohe part from the function and run it outside , it runs but the results look weird: 当我删除ohe部分从功能外运行它,它运行但结果看起来很怪异:

def encode_features(df, columns):
    le = preprocessing.LabelEncoder()
    ohe = preprocessing.OneHotEncoder(sparse=False)
    for i in columns:
        le.fit(df[i].unique())
        df[i+'_le'] = le.transform(df[i])
        # df[i+'_le'] = df[i+'_le'].values.reshape(-1, 1)
        # df[i+'_le'+'_ohe'] = ohe.fit_transform(df[i+'_le'])
    return(df)

data = encode_features(data, ['title_overlap_quartile', 'sales_rank_quartile']) 

data['title_overlap_quartile_le'] = data['title_overlap_quartile_le'].values.reshape(-1, 1)

print(ohe.fit_transform(data['title_overlap_quartile_le']))

[[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]]

I also tried using pandas.get_dummies which converts the variable into multiple columns with binary coding and used it, but that again gets treated as a continuous variable by the decision tree classifier. 我还尝试使用pandas.get_dummies ,它使用二进制编码将变量转换为多列并使用了它,但是再次被决策树分类器视为连续变量。

Can someone please help me with how to fit a decision tree using the categorical variables as categorical and output the tree diagram? 有人可以帮助我如何使用分类变量作为分类来拟合决策树并输出树形图吗?

The code for fitting and drawing the tree I am using is: 拟合和绘制我正在使用的树的代码是:

clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[['title_overlap_score', 'sales_rank_quartile']], data[['rank_grp']])

dot_data = tree.export_graphviz(clf, out_file=None, feature_names=data[['title_overlap_score', 'sales_rank_quartile']].columns,  
                         filled=True, rounded=True,  
                         special_characters=True)  

graph = graphviz.Source(dot_data)  
graph.render("new_tree")

Although decision trees are supposed to handle categorical variables, sklearn's implementation cannot at the moment due to this unresolved bug. 尽管应该使用决策树来处理类别变量,但是由于存在这个未解决的错误,sklearn的实现目前无法实现。 The current workaround, which is sort of convoluted, is to one-hot encode the categorical variables before passing them to the classifier. 当前的解决方法有些复杂,它是在将分类变量传递给分类器之前对其进行一次热编码。

Have you tried category_encoders ? 您是否尝试过category_encoders This is easier to handle, and can also be used within pipelines. 这更易于处理,也可以在管道中使用。

The latest yet to be released version of scikit-learn seems to allow string column types, without conversion to int. scikit-learn的最新尚未发布版本似乎允许字符串列类型,而无需转换为int。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM