使用H2O中的Hyper参数在Sklearn中重新构建XGBoost可以在Python中实现差异性能

Question

After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. 使用H2O Python模块AutoML后，发现XGBoost位于排行榜的顶部。 Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. 然后我想要做的是从H2O XGBoost中提取超参数并在XGBoost Sklearn API中复制它。 However, the performance is different between these 2 approaches: 但是，这两种方法的表现不同：


from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report

import xgboost as xgb
import scikitplot as skplt
import h2o
from h2o.automl import H2OAutoML
import numpy as np
import pandas as pd

h2o.init()


iris = datasets.load_iris()
X = iris.data
y = iris.target

data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1)) 
data.columns = iris.feature_names + ['target']
data = data.sample(frac=1)
# data.shape

train_df = data[:120]
test_df = data[120:]

# Import a sample binary outcome train/test set into H2O
train = h2o.H2OFrame(train_df)
test = h2o.H2OFrame(test_df)

# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
                keep_cross_validation_predictions=True,
                exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
# m.params.keys()

Performance of H2O Xgboost H2O Xgboost的性能

skplt.metrics.plot_confusion_matrix(test_df['target'], 
                                    m.predict(test).as_data_frame()['predict'], 
                                    normalize=False)

Replicate in XGBoost Sklearn API: 在XGBoost Sklearn API中复制：

mapping_dict = {
        "booster": "booster",
        "colsample_bylevel": "col_sample_rate",
        "colsample_bytree": "col_sample_rate_per_tree",
        "gamma": "min_split_improvement",
        "learning_rate": "learn_rate",
        "max_delta_step": "max_delta_step",
        "max_depth": "max_depth",
        "min_child_weight": "min_rows",
        "n_estimators": "ntrees",
        "nthread": "nthread",
        "reg_alpha": "reg_alpha",
        "reg_lambda": "reg_lambda",
        "subsample": "sample_rate",
        "seed": "seed",

        # "max_delta_step": "score_tree_interval",
        #  'missing': None,
        #  'objective': 'binary:logistic',
        #  'scale_pos_weight': 1,
        #  'silent': 1,
        #  'base_score': 0.5,
}

parameter_from_water = {}
for item in mapping_dict.items():
    parameter_from_water[item[0]] = m.params[item[1]]['actual']
# parameter_from_water

xgb_clf = xgb.XGBClassifier(**parameter_from_water)
xgb_clf.fit(train_df.drop('target', axis=1), train_df['target'])

Performance of Sklearn XGBoost: Sklearn XGBoost的表现：
(always worse than H2O in all examples I tried.) （在我尝试过的所有例子中，总是比H2O差。）

skplt.metrics.plot_confusion_matrix(test_df['target'], 
                                    xgb_clf.predict(test_df.drop('target', axis=1)  ), 
                                    normalize=False);

Anything obvious that I missed? 我错过了什么明显的东西？

Answer 1

When you use H2O auto ml with the following lines of code : 当您使用H2O auto ml时，使用以下代码行：

aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
                keep_cross_validation_predictions=True,
                exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)

you use the option nfolds = 3 , which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. 你使用选项nfolds = 3 ，这意味着每个算法将被训练三次，使用三分之二的数据作为训练，三分之一作为验证。 This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go. 这使得算法更稳定，并且有时比只提供整个训练数据集的性能更好。

This is what you do when you train your XGBoost using fit() . 这是您使用fit()训练XGBoost时所执行的操作。 Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. 即使你有相同的算法（XGBoost）具有相同的超参数，您不使用训练集以同样的方式做H2O。 Hence the difference in your confusion matrices ! 因此，你的混淆矩阵的差异！

If you want to have the same performance when copying the best model, you can change the parameter H2OAutoML(..., nfolds = 0) 如果要在复制最佳模型时获得相同的性能，可以更改参数H2OAutoML(..., nfolds = 0)

Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionnary like the min_child_weight . 此外，H2O考虑了大约60个不同的参数，你错过了一些重要的参数，如min_child_weight 。 So your xgboost is not exactly the same as your H2O which could explain the differences in performance 因此，您的xgboost与您的H2O不完全相同，这可以解释性能上的差异

使用H2O中的Hyper参数在Sklearn中重新构建XGBoost可以在Python中实现差异性能

问题描述

1 个解决方案

解决方案1
5 2019-06-20 09:50:14

使用H2O中的Hyper参数在Sklearn中重新构建XGBoost可以在Python中实现差异性能

问题描述

1 个解决方案

解决方案1 5 2019-06-20 09:50:14

解决方案1
5 2019-06-20 09:50:14