I have a dataset of around 1M rows with a high imbalance (743 / 1072780). I am training xgboost model in h2o with the following parameters and it looks like it is overfitting
H2OXGBoostEstimator(max_depth=10,
subsample=0.7,
ntrees=200,
learn_rate=0.5,
min_rows=3,
col_sample_rate_per_tree = .75,
reg_lambda=2.0,
reg_alpha=2.0,
sample_rate = .5,
booster='gbtree',
nfolds=10,
keep_cross_validation_predictions = True,
stopping_metric = 'AUCPR',
min_split_improvement= 1e-5,
categorical_encoding = 'OneHotExplicit',
weights_column = "Products"
)
The output is:
Training data AUCPR: 0.6878932664592388 Validation data AUCPR: 0.04033158660014747
Training data AUC: 0.9992170372214433 Validation data AUC: 0.7000804189162043
Training data MSE: 0.0005722912424124134 Validation data MSE: 0.0010002949568585474
Training data RMSE: 0.023922609439866994 Validation data RMSE: 0.03162743993526108
Training data Gini: 0.9984340744428866 Validation data Gini: 0.40016083783240863
Confusion Matrix for Training Data:
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15900755567210062:
0 1 Error Rate
----- ------ --- ------- ----------------
0 709201 337 0.0005 (337.0/709538.0)
1 189 516 0.2681 (189.0/705.0)
Total 709390 853 0.0007 (526.0/710243.0)
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.047459165255228676:
0 1 Error Rate
----- ------ --- ------- ----------------
0 202084 365 0.0018 (365.0/202449.0)
1 140 52 0.7292 (140.0/192.0)
Total 202224 417 0.0025 (505.0/202641.0)
{'train': , 'valid': }
I am using h2o 3.32.0.1 version (since it's a requirement), xgboost h2o doesnt support balance_classes or scale_pos_weight hyperparameters.
What can cause this to have such performance? Also, What can be improved here for such an imbalanced dataset that might improve the performance?
Training with such severely imbalanced data set is pointless. I would try a combination of up sampling and down sampling to get a more balanced data set that does not get too small.
This may be the worst class imbalance I have ever seen in a problem.
If you can subset your majority class - not until the point that it is balanced - but until the balance is less sever while still being representative (ie, 15/85% minority/majority), you'll have more luck with other conventional techniques, or a mixture (ie, up sampling and augmentation.) Can the data logically be subset to help with the imbalance? For example if data ranges back several years, you could use only the last year's worth of data. I'd also manually optimize the threshold against the minority class, like true positive rate.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.