簡體   English   中英

sklearn管道中具有GridSearchCV,縮放,PCA和早期停止功能的XGBoost

[英]XGBoost with GridSearchCV, Scaling, PCA, and Early-Stopping in sklearn Pipeline

我想將XGBoost模型與輸入縮放和PCA減少的功能空間結合起來。 此外,應使用交叉驗證來調整模型的超參數以及PCA中使用的組件數量。 為防止模型過擬合,應添加早期停止功能。

為了結合各個步驟,我決定使用sklearn的Pipeline功能。

剛開始,我在確保PCA也應用於驗證集方面遇到一些問題。 但是我認為使用XGB__eval_set可以達成協議。

該代碼實際上在運行時沒有任何錯誤,但是似乎可以永遠運行(在某些時候,所有內核的CPU使用率均降至零,但進程繼續運行數小時;在某些時候不得不終止會話)。

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor   

# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X_with_features, y, test_size=0.2, random_state=123)

# Train / Validation split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

# Pipeline
pipe = Pipeline(steps=[("Scale", StandardScaler()),
                       ("PCA", PCA()),
                       ("XGB", XGBRegressor())])

# Hyper-parameter grid (Test only)
grid_param_pipe = {'PCA__n_components': [5],
                   'XGB__n_estimators': [1000],
                   'XGB__max_depth': [3],
                   'XGB__reg_alpha': [0.1],
                   'XGB__reg_lambda': [0.1]}

# Grid object
grid_search_pipe = GridSearchCV(estimator=pipe,
                                param_grid=grid_param_pipe,
                                scoring="neg_mean_squared_error",
                                cv=5,
                                n_jobs=5,
                                verbose=3)

# Run CV
grid_search_pipe.fit(X_train, y_train, XGB__early_stopping_rounds=10, XGB__eval_metric="rmse", XGB__eval_set=[[X_val, y_val]])

問題在於, fit方法需要在外部創建一個評估集,但是我們無法在管道進行轉換之前創建一個評估集。

這有點棘手,但是我們的想法是為xgboost回歸器/分類器創建一個薄包裝器,為內部的評估集做准備。

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor, XGBClassifier

class XGBoostWithEarlyStop(BaseEstimator):
    def __init__(self, early_stopping_rounds=5, test_size=0.1, 
                 eval_metric='mae', **estimator_params):
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.eval_metric=eval_metric='mae'        
        if self.estimator is not None:
            self.set_params(**estimator_params)

    def set_params(self, **params):
        return self.estimator.set_params(**params)

    def get_params(self, **params):
        return self.estimator.get_params()

    def fit(self, X, y):
        x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=self.test_size)
        self.estimator.fit(x_train, y_train, 
                           early_stopping_rounds=self.early_stopping_rounds, 
                           eval_metric=self.eval_metric, eval_set=[(x_val, y_val)])
        return self

    def predict(self, X):
        return self.estimator.predict(X)

class XGBoostRegressorWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBRegressor()
        super(XGBoostRegressorWithEarlyStop, self).__init__(*args, **kwargs)

class XGBoostClassifierWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBClassifier()
        super(XGBoostClassifierWithEarlyStop, self).__init__(*args, **kwargs)

下面是一個測試。

from sklearn.datasets import load_diabetes
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

x, y = load_diabetes(return_X_y=True)
print(x.shape, y.shape)
# (442, 10) (442,)

pipe = Pipeline([
    ('pca', PCA(5)),
    ('xgb', XGBoostRegressorWithEarlyStop())
])

param_grid = {
    'pca__n_components': [3, 5, 7],
    'xgb__n_estimators': [10, 20, 30, 50]
}

grid = GridSearchCV(pipe, param_grid, scoring='neg_mean_absolute_error')
grid.fit(x, y)
print(grid.best_params_)

如果向開發人員請求功能請求,最簡單的擴展是允許XGBRegressor在未提供時在內部創建評估集。 這樣,就不需要擴展scikit-learn了(我想)。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM