简体繁体 English

在多智能体环境中降低一个智能体的动作采样频率

[英]Decreasing action sampling frequency for one agent in a multi-agent environment

原文 2020-07-13 23:13:15 6 1 reinforcement-learning/ ray/ multi-agent/ rllib

I'm using rllib for the first time, and trying to traini a custom multi-agent RL environment, and would like to train a couple of PPO agents on it.我第一次使用 rllib，并尝试训练一个自定义的多代理 RL 环境，并希望在其上训练几个 PPO 代理。 The implementation hiccup I need to figure out is how to alter the training for one special agent such that this one only takes an action every X timesteps.我需要弄清楚的实现问题是如何改变对一名特工的训练，以便这个特工每 X 个时间步才采取一次行动。 Is it best to only call compute_action() every X timesteps?最好只在每 X 个时间步调用 compute_action() 吗？ Or, on the other steps, to mask the policy selection such that they have to re-sample an action until a No-Op is called?或者，在其他步骤中，屏蔽策略选择，以便他们必须重新采样操作，直到调用 No-Op？ Or to modify the action that gets fed into the environment + the previous actions in the training batches to be No-Ops?或者将输入环境的动作 + 训练批次中的先前动作修改为 No-Ops？

What's the easiest way to implement this that still takes advantage of rllib's training features?仍然利用 rllib 的训练功能的最简单的实现方法是什么？ Do I need to create a custom training loop for this, or is there a way to configure PPOTrainer to do this?我需要为此创建一个自定义训练循环，还是有办法配置 PPOTrainer 来做到这一点？

Thanks谢谢

1 个解决方案

Let t:=timesteps so far.让 t:= 到目前为止的时间步长。 Give the special agent this feature: t (mod X), and don't process its actions in the environment when t (mod X).= 0: This accomplishes:给特殊代理这个特性：t (mod X)，并且当 t (mod X).= 0 时不处理它在环境中的动作：这样完成：

the agent in effect is only taking actions every X timesteps because you are ignoring all the other ones实际上，代理仅在每 X 个时间步执行一次操作，因为您忽略了所有其他操作
the agent can learn that only the actions taken every X timesteps will affect the future rewards智能体可以了解到，只有每 X 时间步采取的行动才会影响未来的奖励