简体   繁体   中英

How to scale a gradient norm in Keras

In the pseudocode for MuZero, they do the following:

hidden_state = tf.scale_gradient(hidden_state, 0.5)

From this question about what this means, I learned that this was likely a gradient norm scaling.

How can I do a gradient norm scaling (clipping the gradient norm to a particular length) on a hidden state in Keras? Later on they also do the same scaling on a loss value:

loss += tf.scale_gradient(l, gradient_scale)

This site says that I should use the clipnorm parameter on the optimizer. But I don't think that will work, because I'm scaling the gradients before using the optimizer. (And especially since I'm scaling different things to different lengths.)

Here is the particular code in question from the paper, in case it is helpful. (Note that scale_gradient is not an actual Tensorflow function. See the previously linked question if you are confused, as I was.)

def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
                   weight_decay: float):
  loss = 0
  for image, actions, targets in batch:
    # Initial step, from the real observation.
    value, reward, policy_logits, hidden_state = network.initial_inference(
        image)
    predictions = [(1.0, value, reward, policy_logits)]

    # Recurrent steps, from action and previous hidden state.
    for action in actions:
      value, reward, policy_logits, hidden_state = network.recurrent_inference(
          hidden_state, action)
      predictions.append((1.0 / len(actions), value, reward, policy_logits))

      # THIS LINE HERE
      hidden_state = tf.scale_gradient(hidden_state, 0.5)

    for prediction, target in zip(predictions, targets):
      gradient_scale, value, reward, policy_logits = prediction
      target_value, target_reward, target_policy = target

      l = (
          scalar_loss(value, target_value) +
          scalar_loss(reward, target_reward) +
          tf.nn.softmax_cross_entropy_with_logits(
              logits=policy_logits, labels=target_policy))

      # AND AGAIN HERE
      loss += tf.scale_gradient(l, gradient_scale)

  for weights in network.get_weights():
    loss += weight_decay * tf.nn.l2_loss(weights)

  optimizer.minimize(loss)

(Note that this question is different from this one which is asking about multiplying the gradient by a value, not clipping the gradient to a particular magnitude.)

You can use the MaxNorm constraint presented here .

It's very simple and straightforward. Import it from keras.constraints import MaxNorm

If you want to apply it to weights, when you define a Keras layer, you use kernel_constraint = MaxNorm(max_value=2, axis=0) (read the page for details on axis)

You can also use bias_constraint = ...

If you want to apply it to any other tensor, you can simply call it with a tensor:

normalizer = MaxNorm(max_value=2, axis=0)
normalized_tensor = normalizer(original_tensor)

And you can see the source code is pretty simple:

def __call__(self, w):
    norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
    desired = K.clip(norms, 0, self.max_value)
    return w * (desired / (K.epsilon() + norms))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM