[英]Fluctuating loss during training for text binary classification
I'm doing a finetuning of a Longformer on a document text binary classification task using Huggingface Trainer class and I'm monitoring the measures of some checkpoints with Tensorboard.我做了的细化和微调Longformer上使用文档的文本二元分类任务Huggingface教练班,我监视一些检查站与Tensorboard的措施。
Even if the F1 score and accuracy is quite high, I have perplexities about the fluctuations of training loss .即使 F1 分数和准确率都相当高,但我对训练损失的波动感到困惑。
I read online a reason for that can be:我在网上阅读的原因可能是:
Here I have reported the trend of F1, accuracy, loss and smoothed loss.这里我已经报告了 F1、准确率、损失和平滑损失的趋势。 The grey line is with 1e-6 of learning rate while the pink one is 1e-5.灰线是 1e-6 的学习率,而粉红色的是 1e-5。
I reasume all the info of my training:我恢复了我训练的所有信息:
Which could be the reason?这可能是什么原因? Can this be considered a problem despite the quite good F1 and accuracy results?尽管 F1 和准确度结果相当不错,但这可以被视为问题吗?
I will first tell you the reason for the fluctuations and then a possible way to solve it.我会先告诉你波动的原因,然后告诉你一个可能的解决方法。
REASON原因
When you train a network, you calculate a gradient that would reduce the loss.当你训练一个网络时,你会计算一个可以减少损失的梯度。 In order to do that, you need to backpropagate the loss.为此,您需要反向传播损失。 Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples.现在,理想情况下,您可以根据数据中的所有样本计算损失,因为这样您基本上会考虑每个样本,并提出一个可以捕获所有样本的梯度。 In practice, this is not possible due to the computational complexity of calculating gradient on all samples.实际上,由于计算所有样本的梯度的计算复杂性,这是不可能的。
Therefore, we use small batch_size as an approximation!因此,我们使用 small batch_size 作为近似值! The idea is instead of considering all the samples, we say I compute the gradient-based on some small set of samples but as a trade-off I lose information regarding the gradient.这个想法不是考虑所有样本,我们说我基于一些小样本集计算梯度,但作为权衡,我丢失了有关梯度的信息。
Rule of thumb: Smaller batch sizes give noisy gradients but they converge faster because per epoch you have more updates.经验法则:较小的批次大小会产生嘈杂的梯度,但它们会更快地收敛,因为每个 epoch 都有更多的更新。 If your batch size is 1 you will have N updates per epoch.如果您的批量大小为 1,您将在每个 epoch 中进行 N 次更新。 If it is N, you will only have 1 update per epoch.如果是 N,则每个 epoch 将只有 1 次更新。 On the other hand, larger batch sizes give a more informative gradient but they convergence slower and increase computational conplexity.另一方面,较大的批次大小会提供更多信息梯度,但它们收敛速度较慢并增加计算复杂度。
That is the reason why for smaller batch sizes, you observe varying losses/fluctuations because the gradient is noisy.这就是为什么对于较小的批次大小,您会观察到不同的损失/波动,因为梯度是嘈杂的。
SOLUTION: Accumulated Gradients解决方案:累积梯度
In case of memory issues, you can use the concept of accumulated gradients to combat the fluctuating loss.在内存问题的情况下,您可以使用累积梯度的概念来对抗波动损失。 It calculates the loss and gradients after each mini-batch, but instead of updating the weights on every batch, it waits and accumulates the gradients over consecutive batches.它在每个小批量之后计算损失和梯度,但不是更新每个批次的权重,而是等待并累积连续批次的梯度。 And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.然后最终根据指定批次数后的累积梯度更新参数。
On this page from the documentation you can find how to apply it: https://huggingface.co/transformers/v1.2.0/examples.html在文档的此页面上,您可以找到如何应用它: https : //huggingface.co/transformers/v1.2.0/examples.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.