简体   繁体   English

TensorFlow model 损失中的近似周期性跳跃

[英]Approximately periodic jumps in TensorFlow model loss

I am using tensorflow.keras to train a CNN in an image recognition problem, using the Adam minimiser to minimise a custom loss (some code is at the bottom of the question).我正在使用tensorflow.keras在图像识别问题中训练 CNN,使用 Adam minimiser 来最小化自定义损失(一些代码在问题的底部)。 I am experimenting with how much data I need to use in my training set, and thought I should look into whether each of my models have properly converged.我正在试验我需要在我的训练集中使用多少数据,并认为我应该研究我的每个模型是否正确收敛。 However, when plotting loss vs number of epochs of training for different training set fractions, I noticed approximately periodic spikes in the loss function, as in the plot below.然而,当绘制不同训练集分数的损失与训练时期数时,我注意到损失 function 中的近似周期性峰值,如下面的 plot 所示。 Here, the different lines show different training set sizes as a fraction of my total dataset.在这里,不同的行显示不同的训练集大小作为我总数据集的一小部分。

As I decrease the size of the training set (blue -> orange -> green), the frequency of these spikes appears to decrease, though the amplitude appears to increase.当我减小训练集的大小(蓝色 -> 橙色 -> 绿色)时,这些尖峰的频率似乎会降低,尽管幅度似乎会增加。 Intuitively, I would associate this kind of behaviour with a minimiser jumping out of a local minimum, but I am not experienced enough with TensorFlow/CNNs to know if that is the correct way to interpret this behaviour.直觉上,我会将这种行为与跳出局部最小值的最小化器联系起来,但我对 TensorFlow/CNN 的经验不足,无法知道这是否是解释这种行为的正确方法。 Equally, I can't quite understand the variation with training set size.同样,我不太了解训练集大小的变化。

Can anyone help me to understand this behaviour?谁能帮我理解这种行为? And should I be concerned by these features?我应该担心这些功能吗?

在此处输入图像描述

from quasarnet.models import QuasarNET, custom_loss
from tensorflow.keras.optimizers import Adam

...

model = QuasarNET(
        X[0,:,None].shape, 
        nlines=len(args.lines)+len(args.lines_bal)
        )

loss = []
for i in args.lines:
    loss.append(custom_loss)

for i in args.lines_bal:
    loss.append(custom_loss)

adam = Adam(decay=0.)
model.compile(optimizer=adam, loss=loss, metrics=[])

box, sample_weight = io.objective(z,Y,bal,lines=args.lines, 
        lines_bal=args.lines_bal)

print( "starting fit")
history = model.fit(X[:,:,None], box,
        epochs = args.epochs,
        batch_size = 256,
        sample_weight = sample_weight)

Following some discussion from a colleague, I believe that we have solved this problem.经过同事的一些讨论,我相信我们已经解决了这个问题。 As a default, the Adam minimiser uses an adaptive learning rate that is inversely proportional to the variance of the gradient in its recent history.默认情况下,Adam minimiser 使用与其最近历史中梯度的方差成反比的自适应学习率。 When the loss starts to flatten out, the variance of the gradient decreases, and so the minimiser increases the learning rate.当损失开始变平时,梯度的方差会减小,因此最小化器会提高学习率。 This can happen quite drastically, causing the minimiser to "jump" to a higher loss point in parameter space.这可能会非常剧烈地发生,导致最小化器“跳”到参数空间中更高的损失点。

You can avoid this by setting amsgrad=True when initialising the minimiser ( http://www.satyenkale.com/papers/amsgrad.pdf ).您可以通过在初始化 minimiser ( http://www.satyenkale.com/papers/amsgrad.pdf ) 时设置amsgrad=True来避免这种情况。 This prevents the learning rate from increasing in this way, and thus results in better convergence.这可以防止学习率以这种方式增加,从而导致更好的收敛。 The (somewhat basic) plot below shows loss vs number of training epochs for the normal setup, as in the original question ( norm loss ) compared to the loss when setting amsgrad=True in the minimiser ( amsgrad loss ).下面的(有点基本的)plot 显示了正常设置的损失与训练时期数的关系,如在原始问题( norm loss )中与在 minimiser 中设置amsgrad=True时的损失( amsgrad loss )相比。

Clearly, the loss function is much better behaved with amsgrad=True , and, with more epochs of training, should result in a stable convergence.显然,损失 function 在amsgrad=True的情况下表现得更好,并且随着更多时期的训练,应该会导致稳定的收敛。 在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM