简体   繁体   中英

actor critic policy loss going to zero (with no improvement)

I created an actor critic model to test some OpenAI gym environments. However, I'm having problems in some environments.

CartPole: The model eventually converges and attains the maximum reward. However, for some reason it converges faster if I only use a policy gradient method and not the value function/ advantage.

MountainCar, Acrobot: These two models have negative rewards. If it takes your agent 10 sec to solve the task, your reward will be -10. For some reason, when I try to solve an environment with negative rewards, my policy starts with negative values and slowly converges to 0. The value loss starts absurdly high and starts decrease, although it plateaus at some point (when the policy collapses). Can anyone help me to diagnose the problem? I added a few logging statements with the relevant episodic values.

from scipy.signal import lfilter
import numpy as np
import gym
import tensorflow as tf

layers = tf.keras.layers

tf.enable_eager_execution()


def discount(x, gamma):
    return lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]


def boltzmann(probs):
    return tf.multinomial(tf.log(probs), 1)


def greedy(probs):
    return tf.argmax(probs)


def gae(bval, vals, rews):
    vboot = np.hstack((vals, bval))
    return rews * vboot[1:] - vals


class PG(tf.keras.Model):

    def __init__(self, n_actions, selection_strategy=boltzmann, lr=0.001):
        super(PG, self).__init__()
        self.fc1 = layers.Dense(64, activation='relu', kernel_initializer=tf.initializers.orthogonal(1))
        self.fc2 = layers.Dense(64, activation='relu', kernel_initializer=tf.initializers.orthogonal(1))
        self.pol = layers.Dense(n_actions, kernel_initializer=tf.initializers.orthogonal(0.01))
        self.val = layers.Dense(1, kernel_initializer=tf.initializers.orthogonal(1))
        self.optimizer = tf.train.AdamOptimizer(learning_rate=lr)
        self.selection_strategy = selection_strategy


    def call(self, input):
        x = tf.constant(input, dtype=tf.float32)
        x = self.fc1(x)
        x = self.fc2(x)
        return self.pol(x), self.val(x)


    def select_action(self, logits):
        probs = tf.nn.softmax(logits)
        a = self.selection_strategy(probs)
        return tf.squeeze(a, axis=[0, 1]).numpy()


def sample(env, model):
    obs, act, rews, vals = [], [], [], []
    ob = env.reset()
    done = False

    while not done:
        # env.render()
        logits, value = model([ob])
        a = model.select_action(logits)
        value = tf.squeeze(value, axis=[0, 1])

        next_ob, r, done, _ = env.step(a)
        obs.append(ob)
        act.append(a)
        rews.append(r)
        vals.append(value.numpy())

        ob = next_ob

    return np.array(obs), np.array(act), np.array(rews), np.array(vals)


# Hyperparameters
GAMMA = 0.99
SAMPLES = 10000000
MAX_GRAD_NORM = 20
UPDATE_INTERVAL = 20


env = gym.make('MountainCar-v0')
model = PG(env.action_space.n)


for t in range(1, SAMPLES + 1):
    obs, act, rews, vals = sample(env, model)
    d_rew = discount(rews, GAMMA)
    d_rew = (d_rew - np.mean(d_rew)) / np.std(d_rew)

    advs = d_rew - vals


    with tf.GradientTape() as tape:

        logits, values = model(obs)
        values = tf.squeeze(values)
        one_hot = tf.one_hot(act, env.action_space.n, dtype=tf.float32)
        xentropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=one_hot, logits=logits)
        policy_loss = tf.reduce_mean(xentropy * advs)

        diff = d_rew - values

        value_loss = tf.reduce_mean(tf.square(diff))

        policy = tf.nn.softmax(logits)
        entropy = tf.reduce_mean(policy * tf.log(policy + 1e-20))

        total_loss = policy_loss + 0.5 * value_loss - 0.01 * entropy


    grads = tape.gradient(total_loss, model.trainable_weights)
    grads, gl_norm = tf.clip_by_global_norm(grads, MAX_GRAD_NORM)
    model.optimizer.apply_gradients(zip(grads, model.trainable_weights))


    if t % UPDATE_INTERVAL == 0 and not t is 0:
        print("BR: {0}, Len: {1}, Pol: {2:.4f}, Val: {3:.4f}, Ent: {4:.4f}"
              .format(np.sum(rews), len(rews), policy_loss, value_loss, entropy))

ER = total reward, Len = Episode length, Pol = Policy Loss, Val = Value Loss, Ent = Entropy, Grad Norm = Gradient Norm

ER: -200.0, Len: 200, Pol: 0.0656, Val: 1.0032, Ent: -0.3661, Grad Norm: 0.0901
ER: -200.0, Len: 200, Pol: -0.0384, Val: 1.0006, Ent: -0.3640, Grad Norm: 0.1186
ER: -200.0, Len: 200, Pol: -0.0585, Val: 1.0034, Ent: -0.3605, Grad Norm: 0.0963
ER: -200.0, Len: 200, Pol: -0.0650, Val: 1.0021, Ent: -0.3595, Grad Norm: 0.1149
ER: -200.0, Len: 200, Pol: 0.0007, Val: 1.0011, Ent: -0.3581, Grad Norm: 0.0893
ER: -200.0, Len: 200, Pol: 0.0024, Val: 1.0007, Ent: -0.3556, Grad Norm: 0.0951
ER: -200.0, Len: 200, Pol: 0.0114, Val: 1.0006, Ent: -0.3529, Grad Norm: 0.0954
ER: -200.0, Len: 200, Pol: 0.0310, Val: 1.0006, Ent: -0.3493, Grad Norm: 0.1060
ER: -200.0, Len: 200, Pol: -0.0187, Val: 0.9997, Ent: -0.3449, Grad Norm: 0.1111
ER: -200.0, Len: 200, Pol: -0.0367, Val: 0.9975, Ent: -0.3348, Grad Norm: 0.1302
ER: -200.0, Len: 200, Pol: -0.0349, Val: 0.9988, Ent: -0.3250, Grad Norm: 0.0884

I'm not sure if I can answer your question completely but I'll provide my 2 cents and hopefully someone else comes and fills out the rest!

The model eventually converges and attains the maximum reward. However, for some reason it converges faster if I only use a policy gradient method and not the value function/ advantage.

This is because the CartPole has a very simple action space, of going either left or right. The solution to this problem is very simple and very basic noise added to the system can be enough for the system to explore its state space. In the actor critic method requires more weights and biases to be tuned. And because there are more parameters to be tuned, the training time is longer.

For some reason, when I try to solve an environment with negative rewards, my policy starts with negative values and slowly converges to 0.

xentropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=one_hot, logits=logits)
    policy_loss = tf.reduce_mean(xentropy * advs)

As for this part, I believe that the actual loss formulation is

Loss = - log(policy) * Advantage

Where there is a negative, such as i https://math.stackexchange.com/questions/2730874/cross-entropy-loss-in-reinforcement-learning . In your formulation, I am not sure if you included this negative into your loss function. I personally wrote my own loss function when I constructed my Policy Gradient, but maybe your Tensorflow functon takes this into account.

As for the value, a high loss at the beginning is expected because it is essentially guessing at what the optimal value is.

Some additional tips and tricks is to use a replay memory for your state, actions, reward and s2. This way, you decorrelate the your trajectory and it allows for "even" learning. If your states are correlated, it tends to overfit to your most recent events.

You also are learning online at the moment, which is very unstable for more difficult RL tasks. One way to help this is by the replay memory as above. Another way is to learn in mini-batches. And I believe this is the method David Silver used in his paper. Basically, you want to run many trajectories. After each trajectory, perform backpropagation to calculate the loss of the policy gradient via the tf.gradients method in TensorFlow. Store these gradients, and do this again for the next few trajectories. After a "mini-batch" amount of trajectories, you then average all the gradients across all the runs, and then perform gradient descent to update your parameters. Gradient descent is done identically to what you did in your code using the tf.apply_gradients method. You do this because the environment has a lot of noise, by simulating many trajectories, the idea is that the mean trajectory of the minibatch is a more probabilistic representation, versus just one trajectory. I personally use mini batches of 64.

To enhance your exploration in your state space, I would recommend an Ornstein Ulhenbeck stochastic process. Basically, this is a stable correlated noise system. Because it is correlated noise, it allows you to walk farther away from your initial state than if you used a decorrelated noise (ie, Gaussian Noise). Because if you use decorrelated noise, the long term average would be 0 because it is 0 mean, unit variance. So essentially, if you use decorrelated noise, you'll end up at exactly where you started off at. A good explanation can be found here: https://www.quora.com/Why-do-we-use-the-Ornstein-Uhlenbeck-Process-in-the-exploration-of-DDPG and the code in Python can be found here: https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py at the very bottom of the code. Just simply add this noise to your action to improve the exploration.

Summary

The signage of your loss function for your policy might be incorrect. To improve learning, online learning for difficult problems are very hard. Two easy to implement ways to solve this are:

  • Replay memory
  • Mini-batch gradient descent, instead of stochastic gradient descent currently in your code

To add more stability, you can also use a target network. The idea of the target network is that because at initial stages, your weights will get updated very fast. A target network will be in the system instead to make the problem a "non moving target" problem. The target network's weights are frozen so the problem is non moving, and after each episode, the "real" network is updated. And after x iterations, update the target network to be the real network. But this takes longer to implement. I would suggest the above two first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM