使用 RLlib 时，如何防止我在评估运行期间收到的奖励金额重复出现？

Question

I am using Ray 1.3.0 ( for RLlib ) with a combination of SUMO version 1.9.2 for the simulation of a multi-agent scenario.我正在使用Ray 1.3.0 （用于 RLlib ）和SUMO 1.9.2 版的组合来模拟多代理场景。 I have configured RLlib to use a single PPO.network that is commonly updated/used by all N agents.我已将 RLlib 配置为使用单个PPO.network ，该网络通常由所有N个代理更新/使用。 My evaluation settings look like this:我的评估设置如下所示：

# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.

"evaluation_interval": 20,

# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.

"evaluation_num_episodes": 10,

# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.

"evaluation_parallel_to_training": False,

# Internal flag that is set to True for evaluation workers.

"in_evaluation": True,

# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!

"evaluation_config": {
    # Example: overriding env_config, exploration, etc:
    "lr": 0, # To prevent any kind of learning during evaluation
    "explore": True # As required by PPO (read IMPORTANT NOTE above)
},

# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).

"evaluation_num_workers": 1,

# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.

"custom_eval_function": None,

What happens is every 20 iterations (each iteration collecting "X" training samples), there is an evaluation run of a minimum of 10 episodes .每 20 次迭代（每次迭代收集“X”训练样本）会发生一次至少 10 集的评估运行。 The sum of reward received by all N agents is summed over these episodes and that is set as the reward sum for that particular evaluation run.所有N代理收到的奖励总和在这些情节中求和，并设置为该特定评估运行的奖励总和。 Over time, I notice that there is a pattern with the reward sums that repeats over the same interval of evaluation runs continuously, and the learning goes nowhere.随着时间的推移，我注意到有一种模式，奖励总和在相同的评估间隔内不断重复，学习无处可去。

UPDATE (23/06/2021)更新（23/06/2021）

Unfortunately, I did not have TensorBoard activated for that particular run but from the mean rewards that were collected during evaluations (that happens every 20 iterations) of 10 episodes each, it is clear that there is a repeating pattern as shown in the annotated plot below:不幸的是，我没有为那个特定的运行激活 TensorBoard，但是从每 10 集的评估（每 20 次迭代发生一次）期间收集的平均奖励来看，很明显存在一个重复模式，如下面带注释的 plot 所示:

The 20 agents in the scenario should be learning to avoid colliding but instead continue to somehow stagnate at a certain policy and end up showing the exact same reward sequence during evaluation?场景中的 20 个代理应该学习避免碰撞，而是继续以某种方式停滞在某个策略上并最终在评估期间显示完全相同的奖励序列？

Is this a characteristic of how I have configured the evaluation aspect, or should I be checking something else?这是我如何配置评估方面的特征，还是我应该检查其他东西？ I would be grateful if anyone could advise or point me in the right direction.如果有人能给我建议或指出正确的方向，我将不胜感激。

Thank you.谢谢你。

Answer 1

Step 1 : I noticed that when I stopped the run at some point for any reason, and then restarted it from the saved checkpoint after restoration , most graphs on TensorBoard (including rewards) charted out the line in EXACTLY the same fashion all over again, which made it look like the sequence was repeating.第 1 步：我注意到，当我出于某种原因在某个时刻停止运行，然后在恢复后从保存的检查点重新启动它时，TensorBoard 上的大多数图表（包括奖励）再次以完全相同的方式绘制出这条线，这使得序列看起来像是在重复。

Step 2 : This led me to believe that there was something wrong with my checkpoints.第 2 步：这让我相信我的检查点有问题。 I compared the weights in checkpoints using a loop and voila, they are all the same!我使用循环比较了检查点中的权重，瞧，它们都是一样的！ Not a single change!没有任何变化！ So either there was something wrong with the saving/restoring of checkpoints which after a bit of playing around I found was not the case.因此，要么保存/恢复检查点有问题，但经过一番尝试后我发现情况并非如此。 So it just meant my weights were not being updated !所以这只是意味着我的体重没有更新！

Step 3 : I sifted through my training configuration to see if something there was preventing the.network from learning, and I noticed I had set my "multiagent" configuration option "policies_to_train" to a policy that did not exist .第 3 步：我筛选了我的训练配置，看看是否有什么东西阻止了 .network 学习，我注意到我已经将我的“multiagent”配置选项“policies_to_train”设置为一个不存在的策略。 This unfortunately, either did not throw a warning/error or it did and I completely missed it.不幸的是，这要么没有引发警告/错误，要么引发了，我完全错过了。

Solution step : By setting the multiagent "policies_to_train" configuration option correctly, it started to work!解决步骤：通过正确设置多代理“policies_to_train”配置选项，它开始工作了！

Answer 2

Could it be that due to the multi-agent dynamics, your policy is chasing its tail?会不会是由于多代理动态，你的政策正在追尾？ How many policies do you have?你有多少政策？ Are they competing/collaborating/neutral to each other?他们是相互竞争/合作/中立的吗？ Note that multi-agent training can be very unstable and seeing these fluctuations is quite normal as the different policies get updated and then have to face different "env"-dynamics b/c of that (env=env+all other policies, which appear as part of the env as well).请注意，多智能体训练可能非常不稳定，看到这些波动是很正常的，因为不同的策略得到更新，然后不得不面对不同的“env”-dynamics b/c（env=env+所有其他策略，出现作为 env 的一部分）。

使用 RLlib 时，如何防止我在评估运行期间收到的奖励金额重复出现？

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-06-24 08:47:31

解决方案2
1 2021-06-23 07:03:33

使用 RLlib 时，如何防止我在评估运行期间收到的奖励金额重复出现？

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-06-24 08:47:31

解决方案2 1 2021-06-23 07:03:33

解决方案1
2 已采纳 2021-06-24 08:47:31

解决方案2
1 2021-06-23 07:03:33