如何在sklearn GradientBoostingClassifier中处理类别变量？

Question

I am attempting to train models with GradientBoostingClassifier using categorical variables. 我正在尝试使用分类变量使用GradientBoostingClassifier训练模型。

The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier . 以下是原始代码示例，仅用于尝试将类别变量输入GradientBoostingClassifier 。

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]

X_train = pandas.DataFrame(X_train)

# Insert fake categorical variable. 
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40

# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

The following error appears: 出现以下错误：

ValueError: could not convert string to float: 'b'

From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model. 据我所知，在GradientBoostingClassifier建立模型之前，似乎需要对分类变量进行一次热编码。

Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding? GradientBoostingClassifier可以使用分类变量构建模型而不必进行一种热编码？

R gbm package is capable of handling the sample data above. R gbm软件包能够处理上面的示例数据。 I'm looking for a Python library with equivalent capability. 我正在寻找具有同等功能的Python库。

Answer 1

pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. pandas.get_dummies或statsmodels.tools.tools.categorical可用于将分类变量转换为虚拟矩阵。 We can then merge the dummy matrix back to the training data. 然后，我们可以将虚拟矩阵合并回训练数据。

Below is the example code from the question with the above procedure carried out. 下面是通过上述步骤执行的问题示例代码。

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]


###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.

# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)

catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################

# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

prob = clf.predict_proba(X_test)[:,1]   # Only look at P(y==1).

fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)

print(prob)
print(y_test)
print(roc_auc_prob)

Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators. 感谢Andreas Muller指示不要将熊猫Dataframe用于scikit-learn估计器。

Answer 2

Sure it can handle it, you just have to encode the categorical variables as a separate step on the pipeline. 确保可以处理它，您只需将分类变量编码为管道上的单独步骤即可。 Sklearn is perfectly capable of handling categorical variables as well as R or any other ML package. Sklearn非常有能力处理分类变量以及R或任何其他ML包。 The R package is still (presumably) doing one-hot encoding behind the scenes, it just doesn't separate the concerns of encoding and fitting in this case (as it arguably should). R包仍然（大概）在幕后进行一键编码，在这种情况下，R包没有将编码和匹配的考虑分开（可以说应该如此）。

如何在sklearn GradientBoostingClassifier中处理类别变量？

问题描述

2 个解决方案

解决方案1
11 已采纳 2014-07-21 20:51:32

解决方案2
-4 2014-07-11 21:46:28

如何在sklearn GradientBoostingClassifier中处理类别变量？

问题描述

2 个解决方案

解决方案1 11 已采纳 2014-07-21 20:51:32

解决方案2 -4 2014-07-11 21:46:28

解决方案1
11 已采纳 2014-07-21 20:51:32

解决方案2
-4 2014-07-11 21:46:28