[英]Using Hyper-parameters from H2O to re-build XGBoost in Sklearn gives Difference Performance in Python
After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. 使用H2O Python模块AutoML后,发现XGBoost位于排行榜的顶部。 Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API.
然后我想要做的是从H2O XGBoost中提取超参数并在XGBoost Sklearn API中复制它。 However, the performance is different between these 2 approaches:
但是,这两种方法的表现不同:
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report
import xgboost as xgb
import scikitplot as skplt
import h2o
from h2o.automl import H2OAutoML
import numpy as np
import pandas as pd
h2o.init()
iris = datasets.load_iris()
X = iris.data
y = iris.target
data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1))
data.columns = iris.feature_names + ['target']
data = data.sample(frac=1)
# data.shape
train_df = data[:120]
test_df = data[120:]
# Import a sample binary outcome train/test set into H2O
train = h2o.H2OFrame(train_df)
test = h2o.H2OFrame(test_df)
# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
# m.params.keys()
skplt.metrics.plot_confusion_matrix(test_df['target'],
m.predict(test).as_data_frame()['predict'],
normalize=False)
mapping_dict = {
"booster": "booster",
"colsample_bylevel": "col_sample_rate",
"colsample_bytree": "col_sample_rate_per_tree",
"gamma": "min_split_improvement",
"learning_rate": "learn_rate",
"max_delta_step": "max_delta_step",
"max_depth": "max_depth",
"min_child_weight": "min_rows",
"n_estimators": "ntrees",
"nthread": "nthread",
"reg_alpha": "reg_alpha",
"reg_lambda": "reg_lambda",
"subsample": "sample_rate",
"seed": "seed",
# "max_delta_step": "score_tree_interval",
# 'missing': None,
# 'objective': 'binary:logistic',
# 'scale_pos_weight': 1,
# 'silent': 1,
# 'base_score': 0.5,
}
parameter_from_water = {}
for item in mapping_dict.items():
parameter_from_water[item[0]] = m.params[item[1]]['actual']
# parameter_from_water
xgb_clf = xgb.XGBClassifier(**parameter_from_water)
xgb_clf.fit(train_df.drop('target', axis=1), train_df['target'])
skplt.metrics.plot_confusion_matrix(test_df['target'],
xgb_clf.predict(test_df.drop('target', axis=1) ),
normalize=False);
Anything obvious that I missed? 我错过了什么明显的东西?
When you use H2O auto ml with the following lines of code : 当您使用H2O auto ml时,使用以下代码行:
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
you use the option nfolds = 3
, which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. 你使用选项
nfolds = 3
,这意味着每个算法将被训练三次,使用三分之二的数据作为训练,三分之一作为验证。 This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go. 这使得算法更稳定,并且有时比只提供整个训练数据集的性能更好。
This is what you do when you train your XGBoost using fit()
. 这是您使用
fit()
训练XGBoost时所执行的操作。 Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. 即使你有相同的算法(XGBoost)具有相同的超参数,您不使用训练集以同样的方式做H2O。 Hence the difference in your confusion matrices !
因此,你的混淆矩阵的差异!
If you want to have the same performance when copying the best model, you can change the parameter H2OAutoML(..., nfolds = 0)
如果要在复制最佳模型时获得相同的性能,可以更改参数
H2OAutoML(..., nfolds = 0)
Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionnary like the min_child_weight
. 此外,H2O考虑了大约60个不同的参数,你错过了一些重要的参数,如
min_child_weight
。 So your xgboost is not exactly the same as your H2O which could explain the differences in performance 因此,您的xgboost与您的H2O不完全相同,这可以解释性能上的差异
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.