简体   繁体   English

如何在sklearn GradientBoostingClassifier中处理类别变量?

[英]How to handle categorical variables in sklearn GradientBoostingClassifier?

I am attempting to train models with GradientBoostingClassifier using categorical variables. 我正在尝试使用分类变量使用GradientBoostingClassifier训练模型。

The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier . 以下是原始代码示例,仅用于尝试将类别变量输入GradientBoostingClassifier

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]

X_train = pandas.DataFrame(X_train)

# Insert fake categorical variable. 
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40

# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

The following error appears: 出现以下错误:

ValueError: could not convert string to float: 'b'

From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model. 据我所知,在GradientBoostingClassifier建立模型之前,似乎需要对分类变量进行一次热编码

Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding? GradientBoostingClassifier可以使用分类变量构建模型而不必进行一种热编码?

R gbm package is capable of handling the sample data above. R gbm软件包能够处理上面的示例数据。 I'm looking for a Python library with equivalent capability. 我正在寻找具有同等功能的Python库。

pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. pandas.get_dummiesstatsmodels.tools.tools.categorical可用于将分类变量转换为虚拟矩阵。 We can then merge the dummy matrix back to the training data. 然后,我们可以将虚拟矩阵合并回训练数据。

Below is the example code from the question with the above procedure carried out. 下面是通过上述步骤执行的问题示例代码。

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]


###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.

# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)

catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################

# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

prob = clf.predict_proba(X_test)[:,1]   # Only look at P(y==1).

fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)

print(prob)
print(y_test)
print(roc_auc_prob)

Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators. 感谢Andreas Muller指示不要将熊猫Dataframe用于scikit-learn估计器。

Sure it can handle it, you just have to encode the categorical variables as a separate step on the pipeline. 确保可以处理它,您只需将分类变量编码为管道上的单独步骤即可。 Sklearn is perfectly capable of handling categorical variables as well as R or any other ML package. Sklearn非常有能力处理分类变量以及R或任何其他ML包。 The R package is still (presumably) doing one-hot encoding behind the scenes, it just doesn't separate the concerns of encoding and fitting in this case (as it arguably should). R包仍然(大概)在幕后进行一键编码,在这种情况下,R包没有将编码和匹配的考虑分开(可以说应该如此)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在sklearn中的GradientBoostingClassifier中处理大量缺失值 - How to handle lots of missing values in GradientBoostingClassifier in sklearn 如何在 sklearn 中使用一种热编码处理“看不见的”分类变量 - How to handle "unseen" categorical variables with one hot encoding in sklearn 如何处理sklearn决策树中的分类自变量 - How to handle categorical independent variables in sklearn decision trees 如何在sklearn中将LinearRegression与分类变量一起使用 - How use LinearRegression with categorical variables in sklearn sklearn 随机森林分类器可以处理分类变量吗? - Can sklearn random forest classifier handle categorical variables? 使用 DictVectorizer 的 sklearn 管道中的分类变量 - Categorical variables in sklearn pipeline with DictVectorizer 为sklearn的GradientBoostingClassifier生成代码 - Generate code for sklearn's GradientBoostingClassifier 使用分类变量使用sklearn进行线性回归 - Linear Regression with sklearn using categorical variables 在 sklearn 管道中对分类变量实施 KNN 插补 - Implementing KNN imputation on categorical variables in an sklearn pipeline 为分类变量 sklearn 创建我的自定义 Imputer - Create my custom Imputer for categorical variables sklearn
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM