简体   繁体   English

如何使用XGboost优化sklearn管道,用于不同的`eval_metric`?

[英]How to optimize a sklearn pipeline, using XGboost, for a different `eval_metric`?

I'm trying to use XGBoost , and optimize the eval_metric as auc (as described here ). 我试图用XGBoost ,优化eval_metricauc (如描述在这里 )。

This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline . 这在直接使用分类器时工作正常,但在我尝试将其用作管道时失败。

What is the correct way to pass a .fit argument to the sklearn pipeline? .fit参数传递给sklearn管道的正确方法是什么?

Example: 例:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn

print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)

X, y = load_iris(return_X_y=True)

# Without using the pipeline: 
xgb = XGBClassifier()
xgb.fit(X, y, eval_metric='auc')  # works fine

# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])

# using the pipeline, but not optimizing for 'auc': 
pipe.fit(X, y)  # works fine

# however this does not work (even after correcting the underscores): 
pipe.fit(X, y, classifier__eval_metric='auc')  # fails

The error: 错误:
TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'

Regarding the version of xgboost: 关于xgboost的版本:
xgboost.__version__ shows 0.6 xgboost.__version__显示0.6
pip3 freeze | grep xgboost pip3 freeze | grep xgboost shows xgboost==0.6a2 . pip3 freeze | grep xgboost显示xgboost==0.6a2

The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. 该错误是因为您在管道中使用时在估算器名称及其参数之间使用单个下划线。 It should be two underscores. 它应该是两个下划线。

From the documentation of Pipeline.fit() , we see that the correct way of supplying params in fit: Pipeline.fit()文档中 ,我们看到正确的方式提供params in fit:

Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p. 传递给每个步骤的fit方法的参数,其中每个参数名称都带有前缀,使得步骤s的参数p具有键s__p。

So in your case, the correct usage is: 所以在你的情况下,正确的用法是:

pipe.fit(X_train, y_train, classifier__eval_metric='auc')

(Notice two underscores between name and param) (注意名称和参数之间的两个下划线)

When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV 当目标是优化时,我建议使用sklearn包装器和GridSearchCV

from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV

It looks like 看起来像

pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])

score = 'roc_auc'
pipe.fit(X, y) 

param = {
 'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
}

gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)

Also you can use a technique of cross validation 您也可以使用交叉验证技术

gsearch.fit(X, y)

And you get the best params & the best scores 而你获得最好的参数和最好的分数

gsearch.best_params_, gsearch.best_score_

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM