![](/img/trans.png)
[英]How to handle lots of missing values in GradientBoostingClassifier in sklearn
[英]How to handle categorical variables in sklearn GradientBoostingClassifier?
我正在尝试使用分类变量使用GradientBoostingClassifier训练模型。
以下是原始代码示例,仅用于尝试将类别变量输入GradientBoostingClassifier
。
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
X_train = pandas.DataFrame(X_train)
# Insert fake categorical variable.
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40
# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
出现以下错误:
ValueError: could not convert string to float: 'b'
据我所知,在GradientBoostingClassifier
建立模型之前,似乎需要对分类变量进行一次热编码 。
GradientBoostingClassifier
可以使用分类变量构建模型而不必进行一种热编码?
R gbm软件包能够处理上面的示例数据。 我正在寻找具有同等功能的Python库。
pandas.get_dummies或statsmodels.tools.tools.categorical可用于将分类变量转换为虚拟矩阵。 然后,我们可以将虚拟矩阵合并回训练数据。
下面是通过上述步骤执行的问题示例代码。
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.
# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)
catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################
# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
prob = clf.predict_proba(X_test)[:,1] # Only look at P(y==1).
fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)
print(prob)
print(y_test)
print(roc_auc_prob)
感谢Andreas Muller指示不要将熊猫Dataframe用于scikit-learn估计器。
确保可以处理它,您只需将分类变量编码为管道上的单独步骤即可。 Sklearn非常有能力处理分类变量以及R或任何其他ML包。 R包仍然(大概)在幕后进行一键编码,在这种情况下,R包没有将编码和匹配的考虑分开(可以说应该如此)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.