如何评估xgboost分类model稳定性

Question

I have:我有：

Python xgboost classification model Python xgboost 分类 model
Weekly datasets (basis of classification) since the begining of 2018. Each of dataset has about 100 thousand rows and 70 columns (features).自 2018 年初以来的每周数据集（分类基础）。每个数据集大约有 10 万行和 70 列（特征）。
weekly prediction results on datasets via xgboost model (using logistic regression) in the format:通过 xgboost model（使用逻辑回归）对数据集的每周预测结果，格式为：

- date of modelling
- items
- test_auc_mean for each item (in percentage).

In total there are about 100 datasets and 100 prediction_results since January 2018.自 2018 年 1 月以来，总共有大约 100 个数据集和 100 个预测结果。

To assess the model I use such metrics as:为了评估 model，我使用以下指标：

-auc -auc

-confusion matrix -混淆矩阵

-accuracy -准确性

param = {
    'num_parallel_tree':num_parallel_tree,
    'subsample':subsample,
    'colsample_bytree':colsample_bytree,
    'objective':objective, 
    'learning_rate':learning_rate, 
    'eval_metric':eval_metric, 
    'max_depth':max_depth,
    'scale_pos_weight':scale_pos_weight,
    'min_child_weight':min_child_weight,
    'nthread':nthread,
    'seed':seed
}

bst_cv = xgb.cv(
    param, 
    dtrain,  
    num_boost_round=n_estimators, 
    nfold = nfold,
    early_stopping_rounds=early_stopping_rounds,
    verbose_eval=verbose,
    stratified = stratified
)

test_auc_mean = bst_cv['test-auc-mean']
best_iteration = test_auc_mean[test_auc_mean == max(test_auc_mean)].index[0]

bst = xgb.train(param, 
                dtrain, 
                num_boost_round = best_iteration)

best_train_auc_mean = bst_cv['train-auc-mean'][best_iteration]
best_train_auc_mean_std = bst_cv['train-auc-std'][best_iteration]

best_test_auc_mean = bst_cv['test-auc-mean'][best_iteration]
best_test_auc_mean_std = bst_cv['test-auc-std'][best_iteration]

print('''XGB CV model report
Best train-auc-mean {}% (std: {}%) 
Best test-auc-mean {}% (std: {}%)'''.format(round(best_train_auc_mean * 100, 2), 
                                          round(best_train_auc_mean_std * 100, 2), 
                                          round(best_test_auc_mean * 100, 2), 
                                          round(best_test_auc_mean_std * 100, 2)))

y_pred = bst.predict(dtest)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred>0.9).ravel()


print('''
     | neg | pos |
__________________
true_| {}  | {}  |
false| {}  | {}  |
__________________

'''.format(tn, tp, fn, fp))

predict_accuracy_on_test_set = (tn + tp)/(tn + fp + fn + tp)
print('Test Accuracy: {}%'.format(round(predict_accuracy_on_test_set * 100, 2)))

The model gives me general picture (as usually, auc is between.94 and.96) The problem is that the variability of predicting of some specific items is very high (today an item is positive, tomorrow an item is negative, the day after tomorrow - positive again) model 给了我一般的图片（通常，auc 介于.94 和.96 之间）问题是预测某些特定项目的可变性非常高（今天一个项目是正面的，明天一个项目是负面的，后天明天 - 再次积极）

I wanna evaluate the model' stability.我想评估模型的稳定性。 In other words, I wanna know, how many items with variable results does it generate.换句话说，我想知道它生成了多少具有可变结果的项目。 In the end, I wanna be ensured, that the model will generate stable results with minimal fluctuation.最后，我想确保 model 将产生稳定的结果，波动最小。 Do you have some thoughts how to do this?你有一些想法如何做到这一点？

Answer 1

That's precisely the goal of cross validation.这正是交叉验证的目标。 Since you already did it, you can only evaluate standard deviation of your evaluation metrics, you already did it aswell...既然你已经这样做了，你只能评估你的评估指标的标准差，你也已经这样做了......

You can try some new metrics, like precision,recall,f1 score or fn score to weight success and failure differently but it looks like your almost out of solutions.你可以尝试一些新的指标，比如精确度、召回率、f1 分数或 fn 分数来以不同的方式衡量成功和失败，但看起来你几乎没有解决方案。 You're dependant to your data input here:s您取决于此处的数据输入：s
You could spend some time on training population distribution, and try to identify which part of the population fluctuate over time.您可以花一些时间来训练人口分布，并尝试确定人口的哪一部分随着时间的推移而波动。
You could also try to predict proba and not classification to evaluate if the model is far above its threshold or not.您还可以尝试预测概率而不是分类来评估 model 是否远高于其阈值。

The last two solution are more like side solutions.最后两个解决方案更像是侧面解决方案。 :( :(

Answer 2

Gwendal, thank you.格温达尔，谢谢。 Would you specify 2 approaches you mentioned.您能否指定您提到的两种方法。 1) how can I train population distribution? 1）如何训练人口分布？ via K-Clustering or other methods of unsupervised learning?通过 K-Clustering 或其他无监督学习方法？ 2) Eg I predicted_proba (the diagram of 1 specific item - is in the attachement). 2）例如我预测的_proba（1个特定项目的图表-在附件中）。 How can I evaluate if the model is far above its threshold?如何评估 model 是否远高于其阈值？ Via comparison predicted_proba of each item with it's true label (eg predict_proba = 0.5 and label = 1)?通过比较每个项目的 predict_proba 与它的真实 label（例如 predict_proba = 0.5 和 label = 1）？

如何评估xgboost分类model稳定性

问题描述

2 个解决方案

解决方案1
3 已采纳 2019-10-08 14:56:16

解决方案2
1 2019-10-09 06:53:58

如何评估xgboost分类model稳定性

问题描述

2 个解决方案

解决方案1 3 已采纳 2019-10-08 14:56:16

解决方案2 1 2019-10-09 06:53:58

解决方案1
3 已采纳 2019-10-08 14:56:16

解决方案2
1 2019-10-09 06:53:58