简体   繁体   English

tensorflow 评估和预测的不同结果(F1-Score)

[英]Different results for tensorflow evaluate and predict (F1-Score)

I am using tf 2.5 to evaluate a multiclass classification problem.我正在使用 tf 2.5 来评估多类分类问题。 I am using F1 score since my dataset is highly imbalanced.我使用 F1 分数,因为我的数据集高度不平衡。 The F1 metric I am using is from the tensorflow-addons package.我使用的 F1 指标来自 tensorflow-addons 包。 When I use it with a binary model everything works fine, but results and training gets weird when I am doing multiclass models.当我将它与二元模型一起使用时,一切正常,但是当我在做多类模型时,结果和训练变得很奇怪。

During training and evaluation of the multiclass problem, the F1 score is way higher than it should be.在多类问题的训练和评估期间,F1 分数远高于应有的水平。 In order to check if the score was correct I used scikit-learns F1 score metric and it gave a much more reasonable result.为了检查分数是否正确,我使用了 scikit-learns F1 分数指标,它给出了更合理的结果。 Interestingly, when manually evaluating the prediction with the tfa F1 metric using update_states() the score is the same as scikit-learns.有趣的是,当使用 update_states() 使用 tfa F1 指标手动评估预测时,分数与 scikit-learns 相同。 I am not sure about the reason for that.我不确定这样做的原因。 Probably because evaluate() and fit() use batches?可能是因为评估()和拟合()使用批次? But how could I overcome this problem?但是我怎么能克服这个问题呢? For evaluation its not so much of a problem, since I can just use predict.对于评估,这不是什么大问题,因为我可以只使用预测。 But how can I show a valid F1 training score.但是我如何才能显示有效的 F1 训练分数。

Example F1-Score definition for my 7 class problem我的 7 类问题的 F1-Score 定义示例

tfa.metrics.F1Score(num_classes=7, average='macro', threshold=0.5)

Training训练

model.fit(ds.train_ds,validation_data=ds.val_ds,epochs=EPOCHS)
F1: 0.4163

Evaluation results评价结果

model.evaluate(ds.test_ds)
F1: 0.44059306383132935

Prediction预言

pred = model.predict(ds.test_ds)
metric = tfa.metrics.F1Score(num_classes=7, average='macro', threshold=0.5)
metric.update_state(y_true, y_pred)
result = metric.result()
result.numpy()
F1: 0.1444352

Scikit-Evaluation Scikit 评估

from sklearn.metrics import f1_score
print(f1_score(y_true, y_pred, average='macro'))
F1: 0.1444351874222774

The problem was that the test dataset shuffled after each full iteration.问题是测试数据集在每次完整迭代后都会打乱。 Disabling this led to consistent scores between all evaluation methods禁用此功能会导致所有评估方法之间的分数一致

I simply added an additional parameter for my dataset tuning function:我只是为我的数据集调整功能添加了一个额外的参数:

def __configureperformance__(self,ds,shuffle=True):
    ds = ds.cache()
    if shuffle:
        ds = ds.shuffle(buffer_size=1000)
    ds = ds.batch(self.batch_size)
    ds = ds.prefetch(buffer_size=self.AUTOTUNE)
    return ds

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM