简体   繁体   English

MinMaxScaler + 具有数值和分类数据的决策树分类器

[英]MinMaxScaler + DecisionTree classifier with numerical and categorical data

I would like to know how should I managed the following situation:我想知道我应该如何处理以下情况:

I have a dataset which I need to analyze.我有一个需要分析的数据集。 It is labeled data and I need to perform over it a classification task.它是标记数据,我需要对其执行分类任务。 Some features are numerical and others are categorical (non-ordinal), and my problem is I don't know how can I handle the categorical ones.有些特征是数字的,有些是分类的(非序数的),我的问题是我不知道如何处理分类的。

Before to classify, I usually apply a MinMaxScaler.在分类之前,我通常会应用一个 MinMaxScaler。 But I can't do this in this particular dataset because of the categorical features.但是由于分类特征,我不能在这个特定的数据集中执行此操作。

I've read about the one-hot encoding , but I don't understand how can apply it to my case because my dataset have some numerical features and 10 categorical features and the one-hot encoding generates more columns in the dataframe, and I don't know how do I need to prepare the resultant dataframe to sent it to the decision tree classifier.我已经阅读过one-hot encoding ,但我不明白如何将其应用于我的案例,因为我的数据集有一些数字特征和 10 个分类特征,并且 one-hot 编码在 dataframe 中生成更多列,而我不知道我需要如何准备结果 dataframe 以将其发送到决策树分类器。

In order to clarify the situation the code I'm using so far is the following:为了澄清这种情况,我到目前为止使用的代码如下:

y = df.class
X = df.drop(['class'] , axis=1)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# call DecisionTree classifier

When the df has categorical features I get the following error: TypeError: data type not understood .df具有分类特征时,我收到以下错误: TypeError: data type not understood So, if I apply the one-hot encoding I get a dataframe with many columns and I don't know if the decisionTree classifier is going to understand the real situation of my data.所以,如果我应用one-hot encoding ,我会得到一个包含许多列的 dataframe,我不知道决策树分类器是否会理解我的数据的真实情况。 I mean how can I express to the classifier that a group of columns belongs to a specific feature?我的意思是如何向分类器表达一组列属于特定特征? Am I understanding the whole situation wrong?我对整个情况的理解错了吗? Sorry if this a confused question but I am newbie and I fell pretty confused about how to handle this.抱歉,如果这是一个令人困惑的问题,但我是新手,我对如何处理这个问题感到非常困惑。

I don't have enough reputation to comment, but note that decision tree classifiers don't require their input to be scaled.我没有足够的声誉发表评论,但请注意决策树分类器不需要缩放其输入。 So if you're using a decision tree classifier, just use the features as they appear.因此,如果您使用的是决策树分类器,只需使用出现的特征即可。

If you're using a method that requires feature scaling, then you should probably do one-hot-encoding and feature scaling separately - see this answer: https://stackoverflow.com/a/43798994/9988333如果您使用需要特征缩放的方法,那么您可能应该分别进行一次热编码和特征缩放 - 请参阅此答案: https://stackoverflow.com/a/43798994/9988333

Alternatively, you could use a method that handles categorical variables 'out of the box', such as LGBM.或者,您可以使用“开箱即用”处理分类变量的方法,例如 LGBM。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM