简体   繁体   English

Tensorflow 决策森林自定义指标与树数

[英]Tensorflow decision forest custom metric vs. number of trees

I have created a classification model using tensorflow decision forests.我使用 tensorflow 个决策森林创建了一个分类 model。 I'm struggling to evaluate how the performance changes vs. number of trees for a non-default metric (in this case PR-AUC).我正在努力评估性能如何变化与非默认指标(在本例中为 PR-AUC)的树数。

Below is some code with my attempts.下面是我尝试的一些代码。

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
import tensorflow_decision_forests as tfdf

train = load_diabetes()
X = pd.DataFrame(train['data'])
X['target'] = (pd.Series(train['target']) > 100).astype(int)
X_train, X_test = train_test_split(X)
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X_train, label="target")   
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X_test, label="target")   
pr_auc = tf.keras.metrics.AUC( curve='PR',)
tfdf_clf = tfdf.keras.GradientBoostedTreesModel()
tfdf_clf.compile(metrics=[pr_auc])
tfdf_clf.fit(train_ds, validation_data=test_ds,)

Now I get very useful training logs using现在我得到了非常有用的训练日志

tfdf_clf.make_inspector().training_logs()
#[TrainLog(num_trees=1, evaluation=Evaluation(num_examples=None, accuracy=0.9005518555641174, loss=0.6005926132202148, rmse=None, ndcg=None, aucs=None)),
#TrainLog(num_trees=2, evaluation=Evaluation(num_examples=None, accuracy=0.9005518555641174, loss=0.5672071576118469, rmse=None, ndcg=None, aucs=None)),

But it doesn't contain any info on PR-AUC vs. iterations但它不包含任何关于 PR-AUC 与迭代的信息

If I evaluate the model, it only persists PR-AUC at the end of training, although it seens to log some intermediate info.如果我评估 model,它只会在训练结束时保留 PR-AUC,尽管它似乎记录了一些中间信息。

tfdf_clf.evaluate(test_ds)

1180/1180 [==============================] - 10s 8ms/step - loss: 0.0000e+00 - auc: 0.6832 1180/1180 [==============================] - 10 秒 8 毫秒/步 - 损失:0.0000e+00 - auc: 0.6832

How can I find how test-data PR-AUC changes vs. number of trees?我如何找到测试数据 PR-AUC 随树数的变化情况? I need to specifically use tensforflow decision forest library.我需要专门使用tensforflow决策林库。

The PR-AUC metric is not supported for Gradient Boosted Trees.梯度提升树不支持 PR-AUC 指标。 However, all of the metrics are available for Random Forest.但是,所有指标都可用于随机森林。 You'd need to convert your training data to a format with the same structure as the test data, run it through a gradient boosted trees model trained on train_ds and evaluated with test_ds via train_ds.eval() .您需要将训练数据转换为与测试数据具有相同结构的格式,通过在 train_ds 上训练的梯度提升树模型运行它,并通过 train_ds.eval() 使用 test_ds 进行评估。

The reason Gradient Boosted Trees don't have the PR-AUC metric is that they are trained in a different way than Random Forest. Gradient Boosted Trees 没有 PR-AUC 指标的原因是它们的训练方式与随机森林不同。 They are not regressor, so it wouldn't make sense to return a probability estimate of being positive.它们不是回归量,因此返回为正的概率估计是没有意义的。 Instead, they return just an average class label prediction across all the trees for each test example, with a ranking of labels.相反,它们仅返回每个测试示例的所有树的平均类别标签预测,以及标签排名。 These rankings are used to calculate the aggregated metrics via the AggregatedMetrics API.这些排名用于通过 AggregatedMetrics API 计算聚合指标。 Note that it averages all predictions across all trees during training, so there is no parameter to control which number of samples are used for evaluation purposes.请注意,它在训练期间对所有树的所有预测进行平均,因此没有参数来控制用于评估目的的样本数量。

A better way to evaluate these kinds of models is not with a human metric like PR-AUC, but instead use the auto metrics that are built into Tensorflow.评估这些类型模型的更好方法不是使用 PR-AUC 等人工指标,而是使用 Tensorflow 中内置的自动指标。 This is because they account for model size (smaller models can sometimes be statistically significant but end up overfitting heavily due to their small size), and also allow you to choose how many samples are used in the evaluation (which can be different than the training set).这是因为它们考虑了模型大小(较小的模型有时可能具有统计意义,但由于其较小的大小而最终过度拟合),并且还允许您选择在评估中使用的样本数量(这可能与训练不同放)。

Plot the AUPRC.绘制 AUPRC。 Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold.内插精度-召回曲线下的面积,通过为分类阈值的不同值绘制(召回、精度)点获得。 Depending on how it's calculated, PR AUC may be equivalent to the average precision of the model.It looks like the precision is relatively high, but the recall and the area under the ROC curve (AUC) aren't as high as you might like.根据它的计算方式,PR AUC 可能相当于模型的平均精度。看起来精度比较高,但召回率和 ROC 曲线下面积 (AUC) 并没有你想象的那么高. Classifiers often face challenges when trying to maximize both precision and recall, which is especially true when working with imbalanced datasets.分类器在尝试最大化精度和召回率时经常面临挑战,尤其是在处理不平衡数据集时。 It is important to consider the costs of different types of errors in the context of the problem you care about.在您关心的问题的上下文中考虑不同类型错误的成本非常重要。 In this example, a false negative (a fraudulent transaction is missed) may have a financial cost, while a false positive (a transaction is incorrectly flagged as fraudulent) may decrease user happiness.在此示例中,误报(错过欺诈交易)可能会产生财务成本,而误报(交易被错误地标记为欺诈)可能会降低用户满意度。

In general, the more trees you use the better get the results.一般来说,你使用的树越多,得到的结果就越好。 However, the improvement decreases as the number of trees increases, ie at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees.然而,随着树数量的增加,改进会降低,即在某个点,学习更多树的预测性能的好处将低于学习这些额外树的计算时间成本。 Random forests are ensemble methods, and you average over many trees.随机森林是集成方法,您可以对许多树进行平均。 Similarly, if you want to estimate an average of a real-valued random variable (eg the average heigth of a citizen in your country) you can take a sample.类似地,如果您想估计一个实值随机变量的平均值(例如您所在国家/地区公民的平均身高),您可以取样。 The expected variance will decrease as the square root of the sample size, and at a certain point the cost of collecting a larger sample will be higher than the benefit in accuracy obtained from such larger sample.预期方差将随着样本大小的平方根而减小,并且在某一点上,收集更大样本的成本将高于从此类更大样本获得的准确度收益。 In your case you observe that in a single experiment on a single test set a forest of 10 trees performs better than a forest of 500 trees.在您的情况下,您观察到在单个测试集的单个实验中,10 棵树的森林比 500 棵树的森林表现更好。 This may be due to statistical variance.这可能是由于统计差异造成的。 If this would happen systematically, I would hypothesize that there is something wrong with the implementation.如果这会系统地发生,我会假设实施存在问题。 Typical values for the number of trees is 10, 30 or 100. I think in only very few practical cases more than 300 trees outweights the cost of learning them (well, except maybe if you have a really huge dataset).树的数量的典型值是 10、30 或 100。我认为在极少数实际情况下,超过 300 棵树的权重超过了学习它们的成本(好吧,除非你有一个非常大的数据集)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM