Adding additional loss with constant zero output changes model convergence

Question

I have setup a Returnn Transformer Model for NMT, which I want to train with an additional loss for every encoder/decoder attention head h on every decoder layer l (in addition to the vanilla Cross Entropy loss), ie:

loss = CrossEntropyLoss + sum_{Layer l=1,...,6} sum_{Head h=1,...,8} (lambda * AttentionLoss(l, h))

for some scalar lambda . I implemented the attention loss itself as a eval -Layer, using the loss=as_is option, that returns a single number for each batch (that is the value of lambda * AttentionLoss(l, h) .

As a test, I also implemented a version where I have one loss for each layer l , equivalent to lambda * sum_{Head h=1,...,8} AttentionLoss(l, h) to reduce the number of losses as I noticed a decrease in performance and as log files were getting very large as Returnn prints every loss for each batch.

However, I got very different results for both implementations: A model trained with one loss per layer AND head performs consistently better. I tried this with multiple training runs.

To investigate this, I tried a training run where I set the parameter lambda=0.0 , ie effectively disabled the attention loss. And even here, in comparison to the baseline without any additional losses, a model trained with these additional 6 losses all outputting a constant 0 performs noticably worse, see this table:

+--------------------------------------------+-------------+-------------+
|                                            |   Dev Set   |   Test Set  |
+--------------------------------------------+------+------+------+------+
|                                            | BLEU |  TER | BLEU |  TER |
+--------------------------------------------+------+------+------+------+
| Only Cross Entropy Loss                    | 35.7 | 51.4 | 34.2 | 53.5 |
+--------------------------------------------+------+------+------+------+
| + One loss per layer and head (lambda 0)   | 35.5 | 51.5 | 33.9 | 53.7 |
+--------------------------------------------+------+------+------+------+
| + One loss per layer (lambda 0)            | 35.4 | 51.8 | 33.5 | 54.2 |
+--------------------------------------------+------+------+------+------+
| + Simplified One loss per layer (lambda 0) | 35.1 | 52.0 | 33.5 | 54.3 |
+--------------------------------------------+------+------+------+------+

Here, the "simplified" version is implemented exactly like this:

'dec_01_weight_loss': {
   'class': 'eval', 'eval': '0.0 * tf.reduce_sum(source(0, auto_convert=False))',
   'from': ['dec_01_att_weights'], 'loss': 'as_is',
   'out_type': {   'batch_dim_axis': None, 'dim': None, 'dtype': 'float32', 'feature_dim_axis': None,
                   'shape': (), 'time_dim_axis': None}}

while the actual loss I use is a bit more complicated, I uploaded my full config files here. (Here the loss layer is called dec_01_att_weight_variance etc.)

And all lambda=0.0 implementions mentioned above output the value 0.0 for all additional losses in every training step:

train epoch 1, step 0, cost:output/dec_01_weight_loss 0.0, cost:output/dec_02_weight_loss 0.0, cost:output/dec_03_weight_loss 0.0, [....], cost:output/output_prob 8.541749455164052, error:decision 0.0, error:output/output_prob 0.9999999680730979, loss 8.5417 49, max_mem_usage:GPU:0 1.2GB, mem_usage:GPU:0 1.2GB, 3.999 sec/step, elapsed 0:00:38, exp. remaining 1:30:00, complete 0.71%

What is going on here? Is there any explanation why the models behave differently, why does an additional loss with a constant value 0.0 change the model behavior?

I am using TF 1.15.0 (v1.15.0-0-g590d6eef7e), Returnn 20200613.152716--git-23332ca, using Python 3.8.0 with CUDA 10.1.

Followup Update : I tested the same config using pre-training, where I would disable my loss completely for the first n-1 (here eg n=50 ) checkpoints using the following code:

def custom_construction_algo(idx, net_dict):
    if idx == 0:
        for lay in range(1, 7):
             del net_dict["output"]["unit"]["dec_%02i_att_loss" % lay]
        return net_dict
    else:
        return None
pretrain = {"repetitions": 49, "construction_algo": custom_construction_algo}

In the log file, for the first n-1 checkpoints I (correctly) only see the CE loss being reported.

Here I am showing my Dev BLEU at the last checkpoint trained without the additional loss (ie n-1 , here 49 ), each experiment run multiple times:

Baseline (no additional loss): 31.8, 31.7, 31.7 BLEU
One loss per layer disabled with pretraining: 29.2, 29.0, 28.5 BLEU
One loss per layer with lambda=0.0 (as in original question): 28.8, 28.7 BLEU
One loss per layer AND head with lambda=0.0 (as in original question): 31.8 BLEU

From my understanding, the TF graph for pre-training config and the baseline should be identical up to checkpoint n=50 . Yet, they perform very differently. What is going on?

The full config I used for this kind of pre-training can be found here . The heads of the corresponding log files are found here . I am using NewbobMultiEpoch with Adam:

learning rate control: NewbobMultiEpoch(num_epochs=9, update_interval=1, relative_error_threshold=0, learning_rate_decay_factor=0.7, learning_rate_growth_factor=1.0), epoch data: , error key: None
Create optimizer <class 'tensorflow.python.training.adam.AdamOptimizer'> with options {'beta1': 0.9, 'beta2': 0.999, 'epsilon': 1e-08, 'learning_rate': <tf.Variable 'learning_rate:0' shape=() dtype=float32_ref>}.

For all reported experiments, the learning rate does not decrease until checkpoints larger than 100, staying constant at initial 10^-4 .

EDIT: I did a mistake and accidentally used a different Returnn version across my experiments . The Returnn I used for my experiments with additional losses seems to have contained some local changes I made. When rerunning a baseline with the new version, it performed significantly worse - very similar to the other BLEU values documented here. A subtle bug in one of my Returnn versions - thats all that was to this issue.

Answer 1

You are aware that the training is non-deterministic anyway, right? Did you try to rerun each case a couple of times? Also the baseline? Maybe the baseline itself is an outlier.

Also, changing the computation graph, even if this will be a no-op, can also have an effect. Unfortunately it can be sensitive.

You might want to try setting deterministic_train = True in your config. This might make it a bit more deterministic. Maybe you get the same result then in each of your cases. This might make it a bit slower, though.

The order of parameter initialization might be different as well. The order depends on the order of when the layers are created. Maybe compare that in the log. It is always the same random initializer, but would use a different seed offset then, so you would get another initialization. You could play around by explicitly setting random_seed in the config, and see how much variance you get by that. Maybe all these values are within this range.

For a more in-depth debugging, you could really compare directly the computation graph (in TensorBoard). Maybe there is a difference which you did not notice. Also, maybe make a diff on the log output during net construction, for the case pretrain vs baseline. There should be no diff.

(As this is maybe a mistake, for now only as a side comment: Of course, different RETURNN versions might have some different behavior. So this should be the same.)

Another note: You do not need this tf.reduce_sum in your loss. Actually that might not be such a good idea. Now it will forget about number of frames, and number of seqs. If you just do not use tf.reduce_sum , it should also work, but now you get the correct normalization.

Another note: Instead of your lambda , you can also use loss_scale , which is simpler, and you get the original value in the log.

So basically, you could write it this way:

'dec_01_weight_loss': {
   'class': 'copy', 'from': 'dec_01_att_weights',
   'loss': 'as_is', 'loss_scale': ...}

This should be (mostly) equivalent. Actually it should be more correct, as it will not take the masked frames into account (those behind seq end).

Note that using pretrain (by default) will keep the learning rate fixed. This might be a difference in your experiments. (But simply check your log / learning rate data file for this.) Btw, if this is the case, it looks like the fixed learning rate (probably higher learning rate) seems to perform better, right? So maybe you even want to do that by default?

Also check your log for "reinit because network description differs". This should have no big effect, but who knows. This will also reset the current state of the optimizer (momentum or so; I guess you use Adam?). But even with pretrain, I think you will not have this, as you always keep the network the same.

Actually, speaking of learning rate: How did you configure the learning rate scheduling? It has a somewhat "clever" logic to determine which score to look at (used for the threshold). If it looks at some of your custom losses, the behavior will be different. Esp if you do not use loss_scale as I explained, this will also play a role. You can configure it explicitly via learning_rate_control_error_measure .

As a small demonstration, how you still get some non-zero gradient, even for 0.0 * loss :

import tensorflow as tf
import better_exchook


def main():
  max_seq_len = 15
  seq_len = 10

  logits = tf.zeros([max_seq_len])
  mask = tf.less(tf.range(max_seq_len), seq_len)
  logits_masked = tf.where(mask, logits, float("-inf"))
  ce = -tf.reduce_sum(tf.where(mask, tf.nn.softmax(logits_masked) * tf.nn.log_softmax(logits_masked), 0.0))
  loss = 0.0 * ce

  d_logits, = tf.gradients(loss, [logits])

  with tf.compat.v1.Session() as session:
    print(session.run((ce, loss, d_logits)))


if __name__ == "__main__":
  better_exchook.install()
  tf.compat.v1.disable_eager_execution()
  main()

This will output: (2.3025851, 0.0, array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0., 0., 0., 0., 0.], dtype=float32))

This gets nan , but I think you might also be able to construct cases where you get some non-inf/non-nan/non-zero value.

If you want to dump gradients in your eval layer, or in general in TF code, in a very simple way, you can do this:

from tensorflow.python.framework import ops


@ops.RegisterGradient("IdentityWithPrint")
def _identity_with_print(op, grad):
  with tf.control_dependencies([tf.print([op.name, "grad:", grad])]):
    return [tf.identity(grad)]


def debug_grad(x):
  """
  :param tf.Tensor x:
  :return: x, but gradient will be printed
  :rtype: tf.Tensor
  """
  g = tf.compat.v1.get_default_graph()
  with g.gradient_override_map({"Identity": "IdentityWithPrint"}):
    return tf.identity(x, name=x.name.split("/")[-1].replace(":", "_"))

And then you just write (at the beginning of your eval layer): x = debug_grad(source(0, auto_convert=False)) Or sth like that. Maybe extend the tf.print(...) , eg with summarize=-1 .

Adding additional loss with constant zero output changes model convergence

Question

1 answers

solution1
1 ACCPTED 2020-08-09 11:22:02

Adding additional loss with constant zero output changes model convergence

Question

1 answers

solution1 1 ACCPTED 2020-08-09 11:22:02

solution1
1 ACCPTED 2020-08-09 11:22:02