用常数零添加额外损失 output 改变 model 收敛

Question

I have setup a Returnn Transformer Model for NMT, which I want to train with an additional loss for every encoder/decoder attention head h on every decoder layer l (in addition to the vanilla Cross Entropy loss), ie:我已经为 NMT 设置了一个 Returnn Transformer Model，我想对每个解码器层l上的每个编码器/解码器注意力头h进行额外损失训练（除了香草交叉熵损失），即：

loss = CrossEntropyLoss + sum_{Layer l=1,...,6} sum_{Head h=1,...,8} (lambda * AttentionLoss(l, h))

for some scalar lambda .对于一些标量lambda 。 I implemented the attention loss itself as a eval -Layer, using the loss=as_is option, that returns a single number for each batch (that is the value of lambda * AttentionLoss(l, h) .我使用loss=as_is选项将注意力损失本身实现为eval -Layer，它为每个批次返回一个数字（即lambda * AttentionLoss(l, h)的值。

As a test, I also implemented a version where I have one loss for each layer l , equivalent to lambda * sum_{Head h=1,...,8} AttentionLoss(l, h) to reduce the number of losses as I noticed a decrease in performance and as log files were getting very large as Returnn prints every loss for each batch.作为测试，我还实现了一个版本，其中每一层l都有一个损失，相当于lambda * sum_{Head h=1,...,8} AttentionLoss(l, h)以减少损失的数量，因为我注意到性能下降，并且日志文件变得非常大，因为 Returnn 打印了每批的每个损失。

However, I got very different results for both implementations: A model trained with one loss per layer AND head performs consistently better.然而，我对这两种实现都得到了非常不同的结果：一个 model 训练有一个损失，每层和头部始终表现更好。 I tried this with multiple training runs.我尝试了多次训练。

To investigate this, I tried a training run where I set the parameter lambda=0.0 , ie effectively disabled the attention loss.为了调查这一点，我尝试了一个训练运行，我设置了参数lambda=0.0 ，即有效地禁用了注意力损失。 And even here, in comparison to the baseline without any additional losses, a model trained with these additional 6 losses all outputting a constant 0 performs noticably worse, see this table:即使在这里，与没有任何额外损失的基线相比，使用这 6 个额外损失训练的 model 都输出常数 0 的性能明显更差，请参见下表：

+--------------------------------------------+-------------+-------------+
|                                            |   Dev Set   |   Test Set  |
+--------------------------------------------+------+------+------+------+
|                                            | BLEU |  TER | BLEU |  TER |
+--------------------------------------------+------+------+------+------+
| Only Cross Entropy Loss                    | 35.7 | 51.4 | 34.2 | 53.5 |
+--------------------------------------------+------+------+------+------+
| + One loss per layer and head (lambda 0)   | 35.5 | 51.5 | 33.9 | 53.7 |
+--------------------------------------------+------+------+------+------+
| + One loss per layer (lambda 0)            | 35.4 | 51.8 | 33.5 | 54.2 |
+--------------------------------------------+------+------+------+------+
| + Simplified One loss per layer (lambda 0) | 35.1 | 52.0 | 33.5 | 54.3 |
+--------------------------------------------+------+------+------+------+

Here, the "simplified" version is implemented exactly like this:在这里，“简化”版本的实现方式完全一样：

'dec_01_weight_loss': {
   'class': 'eval', 'eval': '0.0 * tf.reduce_sum(source(0, auto_convert=False))',
   'from': ['dec_01_att_weights'], 'loss': 'as_is',
   'out_type': {   'batch_dim_axis': None, 'dim': None, 'dtype': 'float32', 'feature_dim_axis': None,
                   'shape': (), 'time_dim_axis': None}}

while the actual loss I use is a bit more complicated, I uploaded my full config files here.虽然我使用的实际损失有点复杂，但我在这里上传了完整的配置文件。 (Here the loss layer is called dec_01_att_weight_variance etc.) （这里的损失层称为dec_01_att_weight_variance等）

And all lambda=0.0 implementions mentioned above output the value 0.0 for all additional losses in every training step:上面提到的所有lambda=0.0实现 output 每个训练步骤中所有额外损失的值为0.0 ：

train epoch 1, step 0, cost:output/dec_01_weight_loss 0.0, cost:output/dec_02_weight_loss 0.0, cost:output/dec_03_weight_loss 0.0, [....], cost:output/output_prob 8.541749455164052, error:decision 0.0, error:output/output_prob 0.9999999680730979, loss 8.5417 49, max_mem_usage:GPU:0 1.2GB, mem_usage:GPU:0 1.2GB, 3.999 sec/step, elapsed 0:00:38, exp. remaining 1:30:00, complete 0.71%

What is going on here?这里发生了什么？ Is there any explanation why the models behave differently, why does an additional loss with a constant value 0.0 change the model behavior?有什么解释为什么模型表现不同，为什么常数值为0.0的额外损失会改变 model 行为？

I am using TF 1.15.0 (v1.15.0-0-g590d6eef7e), Returnn 20200613.152716--git-23332ca, using Python 3.8.0 with CUDA 10.1.我正在使用 TF 1.15.0 (v1.15.0-0-g590d6eef7e)，返回 20200613.152716--git-23332ca，使用 Python 3.8.0 和 ZA33B7755E5F9B504DZCA103。

Followup Update : I tested the same config using pre-training, where I would disable my loss completely for the first n-1 (here eg n=50 ) checkpoints using the following code:后续更新：我使用预训练测试了相同的配置，我将使用以下代码完全禁用第一个n-1 （例如n=50 ）检查点的损失：

def custom_construction_algo(idx, net_dict):
    if idx == 0:
        for lay in range(1, 7):
             del net_dict["output"]["unit"]["dec_%02i_att_loss" % lay]
        return net_dict
    else:
        return None
pretrain = {"repetitions": 49, "construction_algo": custom_construction_algo}

In the log file, for the first n-1 checkpoints I (correctly) only see the CE loss being reported.在日志文件中，对于前n-1检查点，我（正确）只看到报告的 CE 丢失。

Here I am showing my Dev BLEU at the last checkpoint trained without the additional loss (ie n-1 , here 49 ), each experiment run multiple times:在这里，我在没有额外损失的最后一个检查点显示我的 Dev BLEU（即n-1 ，这里是49 ），每个实验运行多次：

Baseline (no additional loss): 31.8, 31.7, 31.7 BLEU基线（无额外损失）：31.8、31.7、31.7 BLEU
One loss per layer disabled with pretraining: 29.2, 29.0, 28.5 BLEU预训练禁用的每层损失：29.2、29.0、28.5 BLEU
One loss per layer with lambda=0.0 (as in original question): 28.8, 28.7 BLEU lambda=0.0时每层损失一次（如原始问题）：28.8, 28.7 BLEU
One loss per layer AND head with lambda=0.0 (as in original question): 31.8 BLEU lambda=0.0的每层和头部损失一次（如原始问题）：31.8 BLEU

From my understanding, the TF graph for pre-training config and the baseline should be identical up to checkpoint n=50 .据我了解，预训练配置和基线的 TF 图在检查点n=50之前应该是相同的。 Yet, they perform very differently.然而，它们的表现却截然不同。 What is going on?到底是怎么回事？

The full config I used for this kind of pre-training can be found here .我用于这种预训练的完整配置可以在这里找到。 The heads of the corresponding log files are found here .可在此处找到相应日志文件的头。 I am using NewbobMultiEpoch with Adam:我正在与 Adam 一起使用 NewbobMultiEpoch：

learning rate control: NewbobMultiEpoch(num_epochs=9, update_interval=1, relative_error_threshold=0, learning_rate_decay_factor=0.7, learning_rate_growth_factor=1.0), epoch data: , error key: None
Create optimizer <class 'tensorflow.python.training.adam.AdamOptimizer'> with options {'beta1': 0.9, 'beta2': 0.999, 'epsilon': 1e-08, 'learning_rate': <tf.Variable 'learning_rate:0' shape=() dtype=float32_ref>}.

For all reported experiments, the learning rate does not decrease until checkpoints larger than 100, staying constant at initial 10^-4 .对于所有报告的实验，学习率在检查点大于 100 之前不会降低，在初始10^-4时保持不变。

EDIT: I did a mistake and accidentally used a different Returnn version across my experiments .编辑：我犯了一个错误，在我的实验中不小心使用了不同的 Returnn 版本。 The Returnn I used for my experiments with additional losses seems to have contained some local changes I made.我用于额外损失实验的 Returnn 似乎包含了我所做的一些本地更改。 When rerunning a baseline with the new version, it performed significantly worse - very similar to the other BLEU values documented here.当使用新版本重新运行基线时，它的表现明显更差 - 与此处记录的其他 BLEU 值非常相似。 A subtle bug in one of my Returnn versions - thats all that was to this issue.我的 Returnn 版本中的一个微妙错误 - 这就是这个问题的全部内容。

Answer 1

You are aware that the training is non-deterministic anyway, right?你知道训练无论如何都是不确定的，对吧？ Did you try to rerun each case a couple of times?您是否尝试过多次重新运行每个案例？ Also the baseline?也是底线？ Maybe the baseline itself is an outlier.也许基线本身就是一个异常值。

Also, changing the computation graph, even if this will be a no-op, can also have an effect.此外，更改计算图，即使这将是空操作，也会产生影响。 Unfortunately it can be sensitive.不幸的是，它可能很敏感。

You might want to try setting deterministic_train = True in your config.您可能想尝试在您的配置中设置deterministic_train = True 。 This might make it a bit more deterministic.这可能使它更具确定性。 Maybe you get the same result then in each of your cases.也许您在每种情况下都会得到相同的结果。 This might make it a bit slower, though.不过，这可能会使它变慢一些。

The order of parameter initialization might be different as well.参数初始化的顺序也可能不同。 The order depends on the order of when the layers are created.顺序取决于创建图层的顺序。 Maybe compare that in the log.也许在日志中进行比较。 It is always the same random initializer, but would use a different seed offset then, so you would get another initialization.它始终是相同的随机初始化器，但会使用不同的种子偏移量，因此您将获得另一个初始化。 You could play around by explicitly setting random_seed in the config, and see how much variance you get by that.您可以通过在配置中显式设置random_seed来玩转，看看您会得到多少差异。 Maybe all these values are within this range.也许所有这些值都在这个范围内。

For a more in-depth debugging, you could really compare directly the computation graph (in TensorBoard).对于更深入的调试，您可以直接比较计算图（在 TensorBoard 中）。 Maybe there is a difference which you did not notice.也许有一个你没有注意到的差异。 Also, maybe make a diff on the log output during net construction, for the case pretrain vs baseline.此外，对于预训练与基线的情况，可能在网络构建期间对日志 output 进行比较。 There should be no diff.应该没有差异。

(As this is maybe a mistake, for now only as a side comment: Of course, different RETURNN versions might have some different behavior. So this should be the same.) （因为这可能是一个错误，现在仅作为旁注：当然，不同的 RETURNN 版本可能有一些不同的行为。所以这应该是相同的。）

Another note: You do not need this tf.reduce_sum in your loss.另一个注意事项：您的损失中不需要这个tf.reduce_sum 。 Actually that might not be such a good idea.实际上，这可能不是一个好主意。 Now it will forget about number of frames, and number of seqs.现在它会忘记帧数和序列数。 If you just do not use tf.reduce_sum , it should also work, but now you get the correct normalization.如果你只是不使用tf.reduce_sum ，它也应该可以工作，但现在你得到了正确的标准化。

Another note: Instead of your lambda , you can also use loss_scale , which is simpler, and you get the original value in the log.另一个注意事项：除了您的lambda ，您还可以使用更简单的loss_scale ，并且您可以在日志中获得原始值。

So basically, you could write it this way:所以基本上，你可以这样写：

'dec_01_weight_loss': {
   'class': 'copy', 'from': 'dec_01_att_weights',
   'loss': 'as_is', 'loss_scale': ...}

This should be (mostly) equivalent.这应该（大部分）等效。 Actually it should be more correct, as it will not take the masked frames into account (those behind seq end).实际上它应该更正确，因为它不会考虑被屏蔽的帧（那些在 seq end 后面的帧）。

Note that using pretrain (by default) will keep the learning rate fixed.请注意，使用pretrain （默认情况下）将保持学习率固定。 This might be a difference in your experiments.这可能是您的实验中的差异。 (But simply check your log / learning rate data file for this.) Btw, if this is the case, it looks like the fixed learning rate (probably higher learning rate) seems to perform better, right? （但只需检查您的日志/学习率数据文件。）顺便说一句，如果是这种情况，看起来固定学习率（可能更高的学习率）似乎表现更好，对吧？ So maybe you even want to do that by default?所以也许你甚至想默认这样做？

Also check your log for "reinit because network description differs".还要检查您的日志中的“重新初始化，因为网络描述不同”。 This should have no big effect, but who knows.这应该没有什么大的影响，但谁知道呢。 This will also reset the current state of the optimizer (momentum or so; I guess you use Adam?).这也将重置优化器的当前 state （动量左右；我猜你使用亚当？）。 But even with pretrain, I think you will not have this, as you always keep the network the same.但即使有预训练，我认为你不会有这个，因为你总是保持网络不变。

Actually, speaking of learning rate: How did you configure the learning rate scheduling?实际上，说到学习率：你是如何配置学习率调度的？ It has a somewhat "clever" logic to determine which score to look at (used for the threshold).它有一个有点“聪明”的逻辑来确定要查看的分数（用于阈值）。 If it looks at some of your custom losses, the behavior will be different.如果它查看您的一些自定义损失，则行为会有所不同。 Esp if you do not use loss_scale as I explained, this will also play a role.特别是如果你不按照我解释的那样使用loss_scale ，这也会起作用。 You can configure it explicitly via learning_rate_control_error_measure .您可以通过learning_rate_control_error_measure显式配置它。

As a small demonstration, how you still get some non-zero gradient, even for 0.0 * loss :作为一个小演示，即使对于0.0 * loss ，您仍然如何获得一些非零梯度：

import tensorflow as tf
import better_exchook


def main():
  max_seq_len = 15
  seq_len = 10

  logits = tf.zeros([max_seq_len])
  mask = tf.less(tf.range(max_seq_len), seq_len)
  logits_masked = tf.where(mask, logits, float("-inf"))
  ce = -tf.reduce_sum(tf.where(mask, tf.nn.softmax(logits_masked) * tf.nn.log_softmax(logits_masked), 0.0))
  loss = 0.0 * ce

  d_logits, = tf.gradients(loss, [logits])

  with tf.compat.v1.Session() as session:
    print(session.run((ce, loss, d_logits)))


if __name__ == "__main__":
  better_exchook.install()
  tf.compat.v1.disable_eager_execution()
  main()

This will output: (2.3025851, 0.0, array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0., 0., 0., 0., 0.], dtype=float32))这将 output: (2.3025851, 0.0, array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0., 0., 0., 0., 0.], dtype=float32))

This gets nan , but I think you might also be able to construct cases where you get some non-inf/non-nan/non-zero value.这得到nan ，但我认为您也可以构建获得一些非 inf/non-nan/非零值的情况。

If you want to dump gradients in your eval layer, or in general in TF code, in a very simple way, you can do this:如果你想在你的 eval 层中转储梯度，或者通常在 TF 代码中，以一种非常简单的方式，你可以这样做：

from tensorflow.python.framework import ops


@ops.RegisterGradient("IdentityWithPrint")
def _identity_with_print(op, grad):
  with tf.control_dependencies([tf.print([op.name, "grad:", grad])]):
    return [tf.identity(grad)]


def debug_grad(x):
  """
  :param tf.Tensor x:
  :return: x, but gradient will be printed
  :rtype: tf.Tensor
  """
  g = tf.compat.v1.get_default_graph()
  with g.gradient_override_map({"Identity": "IdentityWithPrint"}):
    return tf.identity(x, name=x.name.split("/")[-1].replace(":", "_"))

And then you just write (at the beginning of your eval layer): x = debug_grad(source(0, auto_convert=False)) Or sth like that.然后你只需写（在你的 eval 层的开头）： x = debug_grad(source(0, auto_convert=False))或类似的东西。 Maybe extend the tf.print(...) , eg with summarize=-1 .也许扩展tf.print(...) ，例如使用summarize=-1 。

用常数零添加额外损失 output 改变 model 收敛

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-09 11:22:02

用常数零添加额外损失 output 改变 model 收敛

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-09 11:22:02

解决方案1
1 已采纳 2020-08-09 11:22:02