简体繁体 English

了解 RLLIB 中 PPO 的张量板图

[英]Understanding tensorboard plots for PPO in RLLIB

原文 2020-03-13 09:30:49 7 1 tensorflow/ reinforcement-learning/ rllib

I am beginner in Deep RL and would like to train my own gym environment in RLLIB with the PPO algorithm.我是深度强化学习的初学者，想在 RLLIB 中使用 PPO 算法训练我自己的健身房环境。 However, I am having some difficulties seeing if my hyperparameter settings are being successful.但是，我在查看我的超参数设置是否成功时遇到了一些困难。 Apart from the obvious episode_reward_mean metric which should rise we have many other plots.除了应该上升的明显的 episode_reward_mean 指标之外，我们还有许多其他图。

I am especially interested in how entropy should evolve during a successful training.我对成功训练期间熵应该如何演变特别感兴趣。 In my case it looks like this:就我而言，它看起来像这样：

entropy.jpg熵.jpg

It is usually dropping below 0 and then converging.它通常下降到 0 以下然后收敛。 I understand that entropy as part of the loss function is enforcing exploration and can therefore speedup learning.我知道作为损失函数一部分的熵正在加强探索，因此可以加速学习。 But why is it getting negative?但是为什么会变成负数呢？ Shouldn't it be always greater or equal to 0?它不应该总是大于或等于 0 吗？

What are other characteristics of a successful training (vf_explained_var, vf_loss, kl,...)?成功训练的其他特征是什么（vf_explained_var、vf_loss、kl、...）？

1 个解决方案

If your action space is continuous, entropy can be negative, because differential entropy can be negative .如果您的动作空间是连续的，则熵可能为负，因为微分熵可能为负。

Ideally, you want the entropy to be decreasing slowly and smoothly over the course of training, as the agent trades exploration in favor of exploitation.理想情况下，您希望熵在训练过程中缓慢而平稳地减少，因为代理会用探索换取开发。

With regards to the vf_* metrics, it's helpful to know what they mean.关于 vf_* 指标，了解它们的含义很有帮助。

In policy gradient methods, it can be helpful to reduce the variance of rollout estimates by using a value function--parameterized by a neural network--to estimate rewards that are farther in the future (check the PPO paper for some math on page 5).在策略梯度方法中，通过使用价值函数（由神经网络参数化）来估计未来更远的奖励，有助于减少 rollout 估计的方差（查看PPO 论文第 5 页上的一些数学知识））。

vf_explained_var is the explained variation of those future rewards through the use of the value function. vf_explained_var是通过使用价值函数解释的那些未来奖励的变化。 You want this to be higher if possible, and it tops out at 1;如果可能，您希望它更高，并且最高为 1； however, if there is randomness in your environment it's unlikely for this to actually hit 1. vf_loss is the error that your value function is incurring;但是，如果您的环境中存在随机性，那么它实际上不太可能达到 1。 vf_loss是您的价值函数产生的错误； ideally this would decrease to 0, though this isn't always possible (due to randomness).理想情况下，这将减少到 0，尽管这并不总是可能的（由于随机性）。 kl is the difference between your old strategy and your new strategy at each time step: you want this to smoothly decrease as you train to indicate convergence. kl是在每个时间步的旧策略和新策略之间的差异：您希望它在训练时平滑减少以指示收敛。