MinMaxScaler + 具有数值和分类数据的决策树分类器

Question

I would like to know how should I managed the following situation:我想知道我应该如何处理以下情况：

I have a dataset which I need to analyze.我有一个需要分析的数据集。 It is labeled data and I need to perform over it a classification task.它是标记数据，我需要对其执行分类任务。 Some features are numerical and others are categorical (non-ordinal), and my problem is I don't know how can I handle the categorical ones.有些特征是数字的，有些是分类的（非序数的），我的问题是我不知道如何处理分类的。

Before to classify, I usually apply a MinMaxScaler.在分类之前，我通常会应用一个 MinMaxScaler。 But I can't do this in this particular dataset because of the categorical features.但是由于分类特征，我不能在这个特定的数据集中执行此操作。

I've read about the one-hot encoding , but I don't understand how can apply it to my case because my dataset have some numerical features and 10 categorical features and the one-hot encoding generates more columns in the dataframe, and I don't know how do I need to prepare the resultant dataframe to sent it to the decision tree classifier.我已经阅读过one-hot encoding ，但我不明白如何将其应用于我的案例，因为我的数据集有一些数字特征和 10 个分类特征，并且 one-hot 编码在 dataframe 中生成更多列，而我不知道我需要如何准备结果 dataframe 以将其发送到决策树分类器。

In order to clarify the situation the code I'm using so far is the following:为了澄清这种情况，我到目前为止使用的代码如下：

y = df.class
X = df.drop(['class'] , axis=1)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# call DecisionTree classifier

When the df has categorical features I get the following error: TypeError: data type not understood .当df具有分类特征时，我收到以下错误： TypeError: data type not understood 。 So, if I apply the one-hot encoding I get a dataframe with many columns and I don't know if the decisionTree classifier is going to understand the real situation of my data.所以，如果我应用one-hot encoding ，我会得到一个包含许多列的 dataframe，我不知道决策树分类器是否会理解我的数据的真实情况。 I mean how can I express to the classifier that a group of columns belongs to a specific feature?我的意思是如何向分类器表达一组列属于特定特征？ Am I understanding the whole situation wrong?我对整个情况的理解错了吗？ Sorry if this a confused question but I am newbie and I fell pretty confused about how to handle this.抱歉，如果这是一个令人困惑的问题，但我是新手，我对如何处理这个问题感到非常困惑。

Answer 1

I don't have enough reputation to comment, but note that decision tree classifiers don't require their input to be scaled.我没有足够的声誉发表评论，但请注意决策树分类器不需要缩放其输入。 So if you're using a decision tree classifier, just use the features as they appear.因此，如果您使用的是决策树分类器，只需使用出现的特征即可。

If you're using a method that requires feature scaling, then you should probably do one-hot-encoding and feature scaling separately - see this answer: https://stackoverflow.com/a/43798994/9988333如果您使用需要特征缩放的方法，那么您可能应该分别进行一次热编码和特征缩放 - 请参阅此答案： https://stackoverflow.com/a/43798994/9988333

Alternatively, you could use a method that handles categorical variables 'out of the box', such as LGBM.或者，您可以使用“开箱即用”处理分类变量的方法，例如 LGBM。

MinMaxScaler + 具有数值和分类数据的决策树分类器

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-10-31 13:52:43

MinMaxScaler + 具有数值和分类数据的决策树分类器

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-10-31 13:52:43

解决方案1
0 已采纳 2019-10-31 13:52:43