使用 RLlib 時，如何防止我在評估運行期間收到的獎勵金額重復出現？

Question

我正在使用Ray 1.3.0 （用於 RLlib ）和SUMO 1.9.2 版的組合來模擬多代理場景。 我已將 RLlib 配置為使用單個PPO.network ，該網絡通常由所有N個代理更新/使用。 我的評估設置如下所示：

# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.

"evaluation_interval": 20,

# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.

"evaluation_num_episodes": 10,

# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.

"evaluation_parallel_to_training": False,

# Internal flag that is set to True for evaluation workers.

"in_evaluation": True,

# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!

"evaluation_config": {
    # Example: overriding env_config, exploration, etc:
    "lr": 0, # To prevent any kind of learning during evaluation
    "explore": True # As required by PPO (read IMPORTANT NOTE above)
},

# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).

"evaluation_num_workers": 1,

# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.

"custom_eval_function": None,

每 20 次迭代（每次迭代收集“X”訓練樣本）會發生一次至少 10 集的評估運行。 所有N代理收到的獎勵總和在這些情節中求和，並設置為該特定評估運行的獎勵總和。 隨着時間的推移，我注意到有一種模式，獎勵總和在相同的評估間隔內不斷重復，學習無處可去。

更新（23/06/2021）

不幸的是，我沒有為那個特定的運行激活 TensorBoard，但是從每 10 集的評估（每 20 次迭代發生一次）期間收集的平均獎勵來看，很明顯存在一個重復模式，如下面帶注釋的 plot 所示:

場景中的 20 個代理應該學習避免碰撞，而是繼續以某種方式停滯在某個策略上並最終在評估期間顯示完全相同的獎勵序列？

這是我如何配置評估方面的特征，還是我應該檢查其他東西？ 如果有人能給我建議或指出正確的方向，我將不勝感激。

謝謝你。

Answer 1

第 1 步：我注意到，當我出於某種原因在某個時刻停止運行，然后在恢復后從保存的檢查點重新啟動它時，TensorBoard 上的大多數圖表（包括獎勵）再次以完全相同的方式繪制出這條線，這使得序列看起來像是在重復。

第 2 步：這讓我相信我的檢查點有問題。我使用循環比較了檢查點中的權重，瞧，它們都是一樣的！ 沒有任何變化！ 因此，要么保存/恢復檢查點有問題，但經過一番嘗試后我發現情況並非如此。 所以這只是意味着我的體重沒有更新！

第 3 步：我篩選了我的訓練配置，看看是否有什么東西阻止了 .network 學習，我注意到我已經將我的“multiagent”配置選項“policies_to_train”設置為一個不存在的策略。 不幸的是，這要么沒有引發警告/錯誤，要么引發了，我完全錯過了。

解決步驟：通過正確設置多代理“policies_to_train”配置選項，它開始工作了！

Answer 2

會不會是由於多代理動態，你的政策正在追尾？ 你有多少政策？ 他們是相互競爭/合作/中立的嗎？ 請注意，多智能體訓練可能非常不穩定，看到這些波動是很正常的，因為不同的策略得到更新，然后不得不面對不同的“env”-dynamics b/c（env=env+所有其他策略，出現作為 env 的一部分）。

使用 RLlib 時，如何防止我在評估運行期間收到的獎勵金額重復出現？

問題描述

2 個解決方案

解決方案1
2 已采納 2021-06-24 08:47:31

解決方案2
1 2021-06-23 07:03:33

使用 RLlib 時，如何防止我在評估運行期間收到的獎勵金額重復出現？

問題描述

2 個解決方案

解決方案1 2 已采納 2021-06-24 08:47:31

解決方案2 1 2021-06-23 07:03:33

解決方案1
2 已采納 2021-06-24 08:47:31

解決方案2
1 2021-06-23 07:03:33