简体   繁体   English

使用 RandomizedSearchCV 调整 XGBoost 超参数

[英]Tuning XGBoost Hyperparameters with RandomizedSearchCV

I''m trying to use XGBoost for a particular dataset that contains around 500,000 observations and 10 features.我正在尝试将 XGBoost 用于包含大约 500,000 个观察值和 10 个特征的特定数据集。 I'm trying to do some hyperparameter tuning with RandomizedSeachCV , and the performance of the model with the best parameters is worse than the one of the model with the default parameters.我正在尝试使用RandomizedSeachCV进行一些超参数调整,并且具有最佳参数的模型的性能比具有默认参数的模型差。

Model with default parameters:具有默认参数的模型:

model = XGBRegressor()
model.fit(X_train,y_train["speed"])
y_predict_speed = model.predict(X_test)

from sklearn.metrics import r2_score
print("R2 score:", r2_score(y_test["speed"],y_predict_speed, multioutput='variance_weighted'))
R2 score: 0.3540656307310167

Best model from random search:随机搜索的最佳模型:

booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]

## Hyper Parameter Optimization
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]

# Define the grid of hyperparameters to search
hyperparameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    'booster':booster,
    'base_score':base_score
    }

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=regressor,
            param_distributions=hyperparameter_grid,
            cv=5, n_iter=50,
            scoring = 'neg_mean_absolute_error',n_jobs = 4,
            verbose = 5, 
            return_train_score = True,
            random_state=42)

random_cv.fit(X_train,y_train["speed"])

random_cv.best_estimator_

XGBRegressor(base_score=0.5, booster='gblinear', colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=-1, importance_type='gain', interaction_constraints=None,
             learning_rate=0.15, max_delta_step=None, max_depth=15,
             min_child_weight=3, missing=nan, monotone_constraints=None,
             n_estimators=500, n_jobs=16, num_parallel_tree=None,
             random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
             subsample=None, tree_method=None, validate_parameters=1,
             verbosity=None)

Using the best model:使用最佳模型:

regressor = XGBRegressor(base_score=0.5, booster='gblinear', colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=-1, importance_type='gain', interaction_constraints=None,
             learning_rate=0.15, max_delta_step=None, max_depth=15,
             min_child_weight=3, monotone_constraints=None,
             n_estimators=500, n_jobs=16, num_parallel_tree=None,
             random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
             subsample=None, tree_method=None, validate_parameters=1,
             verbosity=None)

regressor.fit(X_train,y_train["speed"])
y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
print("R2 score:", r2_score(y_test["speed"],y_pred, multioutput='variance_weighted'))

R2 score: 0.14258774171629718

As you can see after 3 hours of running the randomized search the accuracy actually drops.正如您所看到的,在运行随机搜索 3 小时后,准确率实际上下降了。 If I change linear to tree the value goes up to 0.65, so why is the randomized search not working?如果我将线性更改为树,则该值会上升到 0.65,那么为什么随机搜索不起作用?

I'm also getting a warning with the following:我还收到以下警告:

This may not be accurate due to some parameters are only used in language bindings but passed down to XGBoost core.这可能不准确,因为某些参数仅用于语言绑定但传递给 XGBoost 核心。 Or some parameters are not used but slip through this verification.或者有些参数没有使用,而是通过了这个验证。 Please open an issue if you find above cases.如果您发现上述情况,请打开一个问题。

Does anyone have a suggestion regarding this hyperparameter tuning method?有人对这种超参数调整方法有什么建议吗?

As stated in the XGBoost DocsXGBoost 文档中所述

Parameter tuning is a dark art in machine learning, the optimal parameters of a model can depend on many scenarios.参数调优是机器学习中的一门黑暗艺术,模型的最优参数可以依赖于许多场景。

You asked for suggestions for your specific scenario, so here are some of mine.您要求针对您的特定场景提出建议,所以这里是我的一些建议。

  1. Drop the dimensions booster from your hyperparameter search space.从您的超参数搜索空间中删除维度booster You probably want to go with the default booster 'gbtree'.您可能想要使用默认的助推器“gbtree”。 If you are interested in the performance of a linear model you could just try linear or ridge regression, but don't bother with it during your XGBoost parameter tuning.如果您对线性模型的性能感兴趣,您可以尝试线性回归,但在 XGBoost 参数调整期间不要打扰它。
  2. Drop the dimension base_score from your hyperparameter search space.从超参数搜索空间中删除维度base_score This should not have much of an effect with sufficiently many boosting iterations (see XGB parameter docs ).这应该不会对足够多的提升迭代产生太大影响(请参阅XGB 参数文档)。
  3. Currently you have 3200 hyperparameter combinations in your grid.目前,您的网格中有 3200 个超参数组合。 Expecting to find a good one by looking at 50 random ones might be a bit too optimistic.期望通过随机查看 50 个来找到一个好的可能有点过于乐观了。 After dropping the booster and base_score dimensions you would be down to删除boosterbase_score维度后,您将下降到
hyperparameter_grid = {
    'n_estimators': [100, 500, 900, 1100, 1500],
    'max_depth': [2, 3, 5, 10, 15],
    'learning_rate': [0.05, 0.1, 0.15, 0.20],
    'min_child_weight': [1, 2, 3, 4]
    }

which has 400 possible combinations.其中有 400 种可能的组合。 For a first shot I would simplify this a bit more.对于第一次拍摄,我会稍微简化一下。 For example you could try something like例如,您可以尝试类似

hyperparameter_grid = {
    'n_estimators': [100, 400, 800],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.05, 0.1, 0.20],
    'min_child_weight': [1, 10, 100]
    }

There are only 81 combinations left and some of the very expensive combinations (eg 1500 trees of depth 15) are removed.只剩下 81 个组合,并且移除了一些非常昂贵的组合(例如 1500 棵深度为 15 的树)。 Of course I don't know your data, so maybe it is necessary to consider such large / complex models.当然我不知道你的数据,所以也许有必要考虑这么大/复杂的模型。 For a regression task with squared loss min_child_weight is just the number of instances in a child (again see XGB parameter docs ).对于平方损失的回归任务min_child_weight只是一个孩子的实例数(再次参见XGB 参数文档)。 Since you have 500000 observations, it will probably not make (much of) a difference wether 1, 2, 3 or 4 observations end up in a leaf.由于您有 500000 个观测值,因此 1、2、3 或 4 个观测值最终出现在一片叶子中可能不会产生(很大)差异。 Hence, I am suggesting [1, 10, 100] here.因此,我在这里建议[1, 10, 100] Maybe the random search finds something better than the default parameters in this grid?也许随机搜索在这个网格中找到了比默认参数更好的东西?

  1. An alternative strategy could be: Run cross validation for each combination of另一种策略可能是:对每个组合运行交叉验证
hyperparameter_grid = {
    'max_depth': [3, 6, 9],
    'min_child_weight': [1, 10, 100]
    }

fixing the learning rate at some constant value (not to low, eg 0.15).将学习率固定在某个恒定值(不要太低,例如 0.15)。 For each setting use early stopping to determine an appropriate number of trees.对于每个设置,使用提前停止来确定合适的树木数量。 This is possible using the early_stopping_rounds parameter of the xgboost.cv method.这可以使用xgboost.cv方法的early_stopping_rounds参数实现。 Afterwards you know a good combination of max_depth and min_child_weight (eg how complex do the base learners need to be for the given problem?) and also a good number of trees for this combination and the fixed learning rate.之后,您知道了max_depthmin_child_weight的良好组合(例如,对于给定的问题,基础学习器需要多复杂?)以及这种组合的大量树和固定学习率。 Fine tuning could then involve doing another hyperparameter search "close to" the current (max_depth, min_child_weight) solution and/or reducing the learning rate while increasing the number of trees.然后微调可能涉及进行另一个“接近”当前(max_depth,min_child_weight)解决方案的超参数搜索和/或在增加树数量的同时降低学习率。

  1. And lastly, as answer is getting a bit long, there are other alternatives to a random search if an exhaustive grid search is to expensive.最后,由于答案有点长,如果详尽的网格搜索成本过高,还有其他随机搜索替代方案。 Eg you could look at halving grid search and sequential model based optimization .例如,您可以查看减半网格搜索基于序列模型的优化

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM