简体繁体 English

文本二元分类训练期间的波动损失

[英]Fluctuating loss during training for text binary classification

原文 2020-09-04 14:56:10 8 1 python/ machine-learning/ pytorch/ huggingface-transformers/ allennlp

I'm doing a finetuning of a Longformer on a document text binary classification task using Huggingface Trainer class and I'm monitoring the measures of some checkpoints with Tensorboard.我做了的细化和微调Longformer上使用文档的文本二元分类任务Huggingface教练班，我监视一些检查站与Tensorboard的措施。

Even if the F1 score and accuracy is quite high, I have perplexities about the fluctuations of training loss .即使 F1 分数和准确率都相当高，但我对训练损失的波动感到困惑。

I read online a reason for that can be:我在网上阅读的原因可能是：

the too high learning rate , but I tried with 3 values (1e-4, 1e-5 and 1e-6) and all of them made the same effect学习率太高，但我尝试了 3 个值（1e-4、1e-5 和 1e-6），它们都产生了相同的效果
a small batch size .小批量。 I'm using a Sagemaker notebook p2.8xlarge which has 8xK80 GPUs.我正在使用具有 8xK80 GPU 的Sagemaker 笔记本 p2.8xlarge 。 The batch size per GPU I can use to avoid the CUDA out of memory error is 1. So the total batch size is 8 .我可以用来避免CUDA 内存不足错误的每个 GPU 的批量大小为 1。所以总批量大小为8 。 My intuition is that a bs of 8 is too small for a dataset containing 57K examples (7K steps per epoch).我的直觉是，对于包含 57K 个示例（每个 epoch 7K 步）的数据集来说，8 的 bs 太小了。 Unfortunately it's the highest value I can use.不幸的是，这是我可以使用的最高值。

Here I have reported the trend of F1, accuracy, loss and smoothed loss.这里我已经报告了 F1、准确率、损失和平滑损失的趋势。 The grey line is with 1e-6 of learning rate while the pink one is 1e-5.灰线是 1e-6 的学习率，而粉红色的是 1e-5。

I reasume all the info of my training:我恢复了我训练的所有信息：

batch size : 1 x 8GPU = 8批量大小：1 x 8GPU = 8
learning rate : 1e-4 , 1e-5 , 1e-6 (all of them tested without improvement on loss)学习率： 1e-4 、 1e-5 、 1e-6 （所有这些都经过测试，损失没有改善）
model : Longformer型号: Longformer
dataset :数据集：
- training set : 57K examples训练集： 57K 个例子
- dev set : 12K examples开发集： 12K 示例
- test set : 12K examples测试集： 12K 示例

Which could be the reason?这可能是什么原因？ Can this be considered a problem despite the quite good F1 and accuracy results?尽管 F1 和准确度结果相当不错，但这可以被视为问题吗？

1 个解决方案

I will first tell you the reason for the fluctuations and then a possible way to solve it.我会先告诉你波动的原因，然后告诉你一个可能的解决方法。

REASON原因

When you train a network, you calculate a gradient that would reduce the loss.当你训练一个网络时，你会计算一个可以减少损失的梯度。 In order to do that, you need to backpropagate the loss.为此，您需要反向传播损失。 Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples.现在，理想情况下，您可以根据数据中的所有样本计算损失，因为这样您基本上会考虑每个样本，并提出一个可以捕获所有样本的梯度。 In practice, this is not possible due to the computational complexity of calculating gradient on all samples.实际上，由于计算所有样本的梯度的计算复杂性，这是不可能的。

Therefore, we use small batch_size as an approximation!因此，我们使用 small batch_size 作为近似值！ The idea is instead of considering all the samples, we say I compute the gradient-based on some small set of samples but as a trade-off I lose information regarding the gradient.这个想法不是考虑所有样本，我们说我基于一些小样本集计算梯度，但作为权衡，我丢失了有关梯度的信息。

Rule of thumb: Smaller batch sizes give noisy gradients but they converge faster because per epoch you have more updates.经验法则：较小的批次大小会产生嘈杂的梯度，但它们会更快地收敛，因为每个 epoch 都有更多的更新。 If your batch size is 1 you will have N updates per epoch.如果您的批量大小为 1，您将在每个 epoch 中进行 N 次更新。 If it is N, you will only have 1 update per epoch.如果是 N，则每个 epoch 将只有 1 次更新。 On the other hand, larger batch sizes give a more informative gradient but they convergence slower and increase computational conplexity.另一方面，较大的批次大小会提供更多信息梯度，但它们收敛速度较慢并增加计算复杂度。

That is the reason why for smaller batch sizes, you observe varying losses/fluctuations because the gradient is noisy.这就是为什么对于较小的批次大小，您会观察到不同的损失/波动，因为梯度是嘈杂的。

SOLUTION: Accumulated Gradients解决方案：累积梯度

In case of memory issues, you can use the concept of accumulated gradients to combat the fluctuating loss.在内存问题的情况下，您可以使用累积梯度的概念来对抗波动损失。 It calculates the loss and gradients after each mini-batch, but instead of updating the weights on every batch, it waits and accumulates the gradients over consecutive batches.它在每个小批量之后计算损失和梯度，但不是更新每个批次的权重，而是等待并累积连续批次的梯度。 And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.然后最终根据指定批次数后的累积梯度更新参数。

On this page from the documentation you can find how to apply it: https://huggingface.co/transformers/v1.2.0/examples.html在文档的此页面上，您可以找到如何应用它： https : //huggingface.co/transformers/v1.2.0/examples.html