简体   繁体   English

演员评论家政策损失归零(没有改善)

[英]actor critic policy loss going to zero (with no improvement)

I created an actor critic model to test some OpenAI gym environments. 我创建了一个演员评论模型来测试一些OpenAI健身房环境。 However, I'm having problems in some environments. 但是,我在某些环境中遇到问题。

CartPole: The model eventually converges and attains the maximum reward. CartPole:该模型最终收敛并获得最大奖励。 However, for some reason it converges faster if I only use a policy gradient method and not the value function/ advantage. 但是,由于某些原因,如果我只使用策略梯度方法而不是值函数/优势,它会收敛得更快。

MountainCar, Acrobot: These two models have negative rewards. MountainCar,Acrobot:这两个型号都有负面奖励。 If it takes your agent 10 sec to solve the task, your reward will be -10. 如果你的经纪人需要10秒来解决任务,你的奖励将是-10。 For some reason, when I try to solve an environment with negative rewards, my policy starts with negative values and slowly converges to 0. The value loss starts absurdly high and starts decrease, although it plateaus at some point (when the policy collapses). 出于某种原因,当我试图解决负面回报的环境时,我的政策从负值开始并慢慢收敛到0.价值损失开始荒谬地高并且开始减少,尽管它在某个时候(当政策崩溃时)处于平稳状态。 Can anyone help me to diagnose the problem? 任何人都可以帮我诊断问题吗? I added a few logging statements with the relevant episodic values. 我添加了一些带有相关情节值的日志记录语句。

from scipy.signal import lfilter
import numpy as np
import gym
import tensorflow as tf

layers = tf.keras.layers

tf.enable_eager_execution()


def discount(x, gamma):
    return lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]


def boltzmann(probs):
    return tf.multinomial(tf.log(probs), 1)


def greedy(probs):
    return tf.argmax(probs)


def gae(bval, vals, rews):
    vboot = np.hstack((vals, bval))
    return rews * vboot[1:] - vals


class PG(tf.keras.Model):

    def __init__(self, n_actions, selection_strategy=boltzmann, lr=0.001):
        super(PG, self).__init__()
        self.fc1 = layers.Dense(64, activation='relu', kernel_initializer=tf.initializers.orthogonal(1))
        self.fc2 = layers.Dense(64, activation='relu', kernel_initializer=tf.initializers.orthogonal(1))
        self.pol = layers.Dense(n_actions, kernel_initializer=tf.initializers.orthogonal(0.01))
        self.val = layers.Dense(1, kernel_initializer=tf.initializers.orthogonal(1))
        self.optimizer = tf.train.AdamOptimizer(learning_rate=lr)
        self.selection_strategy = selection_strategy


    def call(self, input):
        x = tf.constant(input, dtype=tf.float32)
        x = self.fc1(x)
        x = self.fc2(x)
        return self.pol(x), self.val(x)


    def select_action(self, logits):
        probs = tf.nn.softmax(logits)
        a = self.selection_strategy(probs)
        return tf.squeeze(a, axis=[0, 1]).numpy()


def sample(env, model):
    obs, act, rews, vals = [], [], [], []
    ob = env.reset()
    done = False

    while not done:
        # env.render()
        logits, value = model([ob])
        a = model.select_action(logits)
        value = tf.squeeze(value, axis=[0, 1])

        next_ob, r, done, _ = env.step(a)
        obs.append(ob)
        act.append(a)
        rews.append(r)
        vals.append(value.numpy())

        ob = next_ob

    return np.array(obs), np.array(act), np.array(rews), np.array(vals)


# Hyperparameters
GAMMA = 0.99
SAMPLES = 10000000
MAX_GRAD_NORM = 20
UPDATE_INTERVAL = 20


env = gym.make('MountainCar-v0')
model = PG(env.action_space.n)


for t in range(1, SAMPLES + 1):
    obs, act, rews, vals = sample(env, model)
    d_rew = discount(rews, GAMMA)
    d_rew = (d_rew - np.mean(d_rew)) / np.std(d_rew)

    advs = d_rew - vals


    with tf.GradientTape() as tape:

        logits, values = model(obs)
        values = tf.squeeze(values)
        one_hot = tf.one_hot(act, env.action_space.n, dtype=tf.float32)
        xentropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=one_hot, logits=logits)
        policy_loss = tf.reduce_mean(xentropy * advs)

        diff = d_rew - values

        value_loss = tf.reduce_mean(tf.square(diff))

        policy = tf.nn.softmax(logits)
        entropy = tf.reduce_mean(policy * tf.log(policy + 1e-20))

        total_loss = policy_loss + 0.5 * value_loss - 0.01 * entropy


    grads = tape.gradient(total_loss, model.trainable_weights)
    grads, gl_norm = tf.clip_by_global_norm(grads, MAX_GRAD_NORM)
    model.optimizer.apply_gradients(zip(grads, model.trainable_weights))


    if t % UPDATE_INTERVAL == 0 and not t is 0:
        print("BR: {0}, Len: {1}, Pol: {2:.4f}, Val: {3:.4f}, Ent: {4:.4f}"
              .format(np.sum(rews), len(rews), policy_loss, value_loss, entropy))

ER = total reward, Len = Episode length, Pol = Policy Loss, Val = Value Loss, Ent = Entropy, Grad Norm = Gradient Norm ER =总奖励,Len =剧集长度,Pol =政策损失,Val =价值损失,Ent =熵,Grad Norm = Gradient Norm

ER: -200.0, Len: 200, Pol: 0.0656, Val: 1.0032, Ent: -0.3661, Grad Norm: 0.0901
ER: -200.0, Len: 200, Pol: -0.0384, Val: 1.0006, Ent: -0.3640, Grad Norm: 0.1186
ER: -200.0, Len: 200, Pol: -0.0585, Val: 1.0034, Ent: -0.3605, Grad Norm: 0.0963
ER: -200.0, Len: 200, Pol: -0.0650, Val: 1.0021, Ent: -0.3595, Grad Norm: 0.1149
ER: -200.0, Len: 200, Pol: 0.0007, Val: 1.0011, Ent: -0.3581, Grad Norm: 0.0893
ER: -200.0, Len: 200, Pol: 0.0024, Val: 1.0007, Ent: -0.3556, Grad Norm: 0.0951
ER: -200.0, Len: 200, Pol: 0.0114, Val: 1.0006, Ent: -0.3529, Grad Norm: 0.0954
ER: -200.0, Len: 200, Pol: 0.0310, Val: 1.0006, Ent: -0.3493, Grad Norm: 0.1060
ER: -200.0, Len: 200, Pol: -0.0187, Val: 0.9997, Ent: -0.3449, Grad Norm: 0.1111
ER: -200.0, Len: 200, Pol: -0.0367, Val: 0.9975, Ent: -0.3348, Grad Norm: 0.1302
ER: -200.0, Len: 200, Pol: -0.0349, Val: 0.9988, Ent: -0.3250, Grad Norm: 0.0884

I'm not sure if I can answer your question completely but I'll provide my 2 cents and hopefully someone else comes and fills out the rest! 我不确定我是否可以完全回答你的问题,但我会提供2美分,希望其他人来,并填写其余的!

The model eventually converges and attains the maximum reward. 该模型最终收敛并获得最大奖励。 However, for some reason it converges faster if I only use a policy gradient method and not the value function/ advantage. 但是,由于某些原因,如果我只使用策略梯度方法而不是值函数/优势,它会收敛得更快。

This is because the CartPole has a very simple action space, of going either left or right. 这是因为CartPole有一个非常简单的动作空间,无论是向左还是向右。 The solution to this problem is very simple and very basic noise added to the system can be enough for the system to explore its state space. 这个问题的解决方案非常简单,系统中添加的基本噪声足以让系统探索其状态空间。 In the actor critic method requires more weights and biases to be tuned. 在演员评论方法中,需要调整更多权重和偏见。 And because there are more parameters to be tuned, the training time is longer. 并且因为需要调整更多参数,所以训练时间更长。

For some reason, when I try to solve an environment with negative rewards, my policy starts with negative values and slowly converges to 0. 出于某种原因,当我尝试解决具有负奖励的环境时,我的策略从负值开始并慢慢收敛到0。

xentropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=one_hot, logits=logits)
    policy_loss = tf.reduce_mean(xentropy * advs)

As for this part, I believe that the actual loss formulation is 至于这部分,我认为实际的亏损表述是

Loss = - log(policy) * Advantage

Where there is a negative, such as i https://math.stackexchange.com/questions/2730874/cross-entropy-loss-in-reinforcement-learning . 哪里有否定,例如我https://math.stackexchange.com/questions/2730874/cross-entropy-loss-in-reinforcement-learning In your formulation, I am not sure if you included this negative into your loss function. 在你的表述中,我不确定你是否将这个负面信息包括在你的损失函数中。 I personally wrote my own loss function when I constructed my Policy Gradient, but maybe your Tensorflow functon takes this into account. 我在构建Policy Gradient时亲自编写了自己的损失函数,但也许你的Tensorflow功能会考虑到这一点。

As for the value, a high loss at the beginning is expected because it is essentially guessing at what the optimal value is. 至于价值,预计会在开始时出现高损失,因为它基本上是在猜测最优值是什么。

Some additional tips and tricks is to use a replay memory for your state, actions, reward and s2. 一些额外的提示和技巧是为你的状态,行动,奖励和s2使用重播记忆。 This way, you decorrelate the your trajectory and it allows for "even" learning. 通过这种方式,您可以对您的轨迹进行去相关,并允许“均匀”学习。 If your states are correlated, it tends to overfit to your most recent events. 如果您的州是相关的,那么它往往会过度适应您最近的事件。

You also are learning online at the moment, which is very unstable for more difficult RL tasks. 你现在也在网上学习,这对于更困难的RL任务来说非常不稳定。 One way to help this is by the replay memory as above. 帮助这一点的一种方法是通过上面的重放记忆。 Another way is to learn in mini-batches. 另一种方法是以小批量学习。 And I believe this is the method David Silver used in his paper. 我相信这是David Silver在他的论文中使用的方法。 Basically, you want to run many trajectories. 基本上,你想要运行许多轨迹。 After each trajectory, perform backpropagation to calculate the loss of the policy gradient via the tf.gradients method in TensorFlow. 在每个轨迹之后,执行反向传播以通过TensorFlow中的tf.gradients方法计算策略梯度的损失。 Store these gradients, and do this again for the next few trajectories. 存储这些渐变,并在接下来的几个轨迹中再次执行此操作。 After a "mini-batch" amount of trajectories, you then average all the gradients across all the runs, and then perform gradient descent to update your parameters. 在“小批量”轨迹量之后,您可以平均所有运行中的所有梯度,然后执行梯度下降以更新参数。 Gradient descent is done identically to what you did in your code using the tf.apply_gradients method. 使用tf.apply_gradients方法,梯度下降与您在代码中的操作完全相同。 You do this because the environment has a lot of noise, by simulating many trajectories, the idea is that the mean trajectory of the minibatch is a more probabilistic representation, versus just one trajectory. 你这样做是因为环境有很多噪音,通过模拟许多轨迹,我们的想法是,迷你投影的平均轨迹是一个更具概率的表示,而不是一个轨迹。 I personally use mini batches of 64. 我个人使用64个迷你批次。

To enhance your exploration in your state space, I would recommend an Ornstein Ulhenbeck stochastic process. 为了增强您在州空间的探索,我建议使用Ornstein Ulhenbeck随机过程。 Basically, this is a stable correlated noise system. 基本上,这是一个稳定的相关噪声系统。 Because it is correlated noise, it allows you to walk farther away from your initial state than if you used a decorrelated noise (ie, Gaussian Noise). 因为它是相关噪声,所以与使用去相关噪声(即高斯噪声)相比,它允许您走远离初始状态。 Because if you use decorrelated noise, the long term average would be 0 because it is 0 mean, unit variance. 因为如果使用去相关噪声,则长期平均值为0,因为它是0均值,单位方差。 So essentially, if you use decorrelated noise, you'll end up at exactly where you started off at. 基本上,如果你使用去相关的噪音,你最终会到达你开始的地方。 A good explanation can be found here: https://www.quora.com/Why-do-we-use-the-Ornstein-Uhlenbeck-Process-in-the-exploration-of-DDPG and the code in Python can be found here: https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py at the very bottom of the code. 可以在这里找到一个很好的解释: https//www.quora.com/Why-do-we-use-the-Ornstein-Uhlenbeck-Process-in-the-exploration-of-DDPG,Python中的代码可以是在这里找到: https//github.com/openai/baselines/blob/master/baselines/ddpg/noise.py位于代码的最底部。 Just simply add this noise to your action to improve the exploration. 只需将此噪音添加到您的操作中即可改善探索。

Summary 摘要

The signage of your loss function for your policy might be incorrect. 您的保单的损失功能标志可能不正确。 To improve learning, online learning for difficult problems are very hard. 为了改善学习,在线学习困难问题非常困难。 Two easy to implement ways to solve this are: 两种易于实现的解决方法是:

  • Replay memory 重播记忆
  • Mini-batch gradient descent, instead of stochastic gradient descent currently in your code 小批量梯度下降,而不是代码中当前的随机梯度下降

To add more stability, you can also use a target network. 要增加更多稳定性,您还可以使用目标网络。 The idea of the target network is that because at initial stages, your weights will get updated very fast. 目标网络的想法是,因为在初始阶段,您的权重将非常快速地更新。 A target network will be in the system instead to make the problem a "non moving target" problem. 目标网络将在系统中,而不是使问题成为“非移动目标”问题。 The target network's weights are frozen so the problem is non moving, and after each episode, the "real" network is updated. 目标网络的权重被冻结,因此问题不会移动,并且在每一集之后,“真实”网络被更新。 And after x iterations, update the target network to be the real network. 在x次迭代之后,将目标网络更新为真实网络。 But this takes longer to implement. 但这需要更长的时间来实施。 I would suggest the above two first. 我先建议以上两点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM