简体   繁体   中英

Terrible performance with h2o xgboost imbalanced data

I have a dataset of around 1M rows with a high imbalance (743 / 1072780). I am training xgboost model in h2o with the following parameters and it looks like it is overfitting

H2OXGBoostEstimator(max_depth=10,
                                   subsample=0.7,
                                   ntrees=200,
                                   learn_rate=0.5,
                                   min_rows=3,
                                   col_sample_rate_per_tree = .75,
                                   reg_lambda=2.0,
                                   reg_alpha=2.0,
                                   sample_rate = .5,
                                   booster='gbtree',
                                   nfolds=10,
                                   keep_cross_validation_predictions = True,
                                   stopping_metric = 'AUCPR',
                                   min_split_improvement= 1e-5,
                                   categorical_encoding  = 'OneHotExplicit',
                                    weights_column = "Products"
                                  )

The output is:

Training data AUCPR: 0.6878932664592388       Validation data AUCPR: 0.04033158660014747
Training data AUC: 0.9992170372214433           Validation data AUC: 0.7000804189162043
Training data MSE: 0.0005722912424124134           Validation data MSE: 0.0010002949568585474
Training data RMSE: 0.023922609439866994         Validation data RMSE: 0.03162743993526108
Training data Gini: 0.9984340744428866         Validation data Gini: 0.40016083783240863
Confusion Matrix for Training Data:
 
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15900755567210062: 
       0       1    Error    Rate
-----  ------  ---  -------  ----------------
0      709201  337  0.0005   (337.0/709538.0)
1      189     516  0.2681   (189.0/705.0)
Total  709390  853  0.0007   (526.0/710243.0)

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.047459165255228676: 
       0       1    Error    Rate
-----  ------  ---  -------  ----------------
0      202084  365  0.0018   (365.0/202449.0)
1      140     52   0.7292   (140.0/192.0)
Total  202224  417  0.0025   (505.0/202641.0)
{'train': , 'valid': }

I am using h2o 3.32.0.1 version (since it's a requirement), xgboost h2o doesnt support balance_classes or scale_pos_weight hyperparameters.

What can cause this to have such performance? Also, What can be improved here for such an imbalanced dataset that might improve the performance?

Training with such severely imbalanced data set is pointless. I would try a combination of up sampling and down sampling to get a more balanced data set that does not get too small.

This may be the worst class imbalance I have ever seen in a problem.

If you can subset your majority class - not until the point that it is balanced - but until the balance is less sever while still being representative (ie, 15/85% minority/majority), you'll have more luck with other conventional techniques, or a mixture (ie, up sampling and augmentation.) Can the data logically be subset to help with the imbalance? For example if data ranges back several years, you could use only the last year's worth of data. I'd also manually optimize the threshold against the minority class, like true positive rate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM