[英]Atari score vs reward in rllib DQN implementation
I'm trying to replicate DQN scores for Breakout using RLLib.我正在尝试使用 RLLib 复制 Breakout 的 DQN 分数。 After 5M steps the average reward is 2.0 while the known score for Breakout using DQN is 100+.
在 500 万步之后,平均奖励为 2.0,而使用 DQN 的 Breakout 已知得分为 100+。 I'm wondering if this is because of reward clipping and therefore actual reward does not correspond to score from Atari.
我想知道这是否是因为奖励削减,因此实际奖励与 Atari 的分数不对应。 In OpenAI baselines, the actual score is placed in
info['r']
the reward value is actually the clipped value.在 OpenAI 基线中,实际分数放在
info['r']
中,奖励值实际上是裁剪值。 Is this the same case for RLLib? RLLib 的情况是否相同? Is there any way to see actual average score while training?
有没有办法在训练时查看实际的平均分数?
According to the list of trainer parameters , the library will clip Atari rewards by default:根据训练器参数列表,库会默认裁剪 Atari 奖励:
# Whether to clip rewards prior to experience postprocessing. Setting to
# None means clip for Atari only.
"clip_rewards": None,
However, the episode_reward_mean
reported on tensorboard should still correspond to the actual, non-clipped scores.但是,张量板上报告的
episode_reward_mean
仍应对应于实际的非剪辑分数。
While the average score of 2 is not much at all relative to the benchmarks for Breakout, 5M steps may not be large enough for DQN unless you are employing something akin to the rainbow to significantly speed things up.虽然 2 的平均分数与 Breakout 的基准相比根本不高,但 5M 步对于 DQN 来说可能不够大,除非您使用类似于彩虹的东西来显着加快速度。 Even then, DQN is notoriously slow to converge , so you may want to check your results using a longer run instead and/or consider upgrading your DQN configurations.
即使这样, DQN 的收敛速度也是出了名的慢,因此您可能希望使用更长的运行时间来检查结果和/或考虑升级您的 DQN 配置。
I've thrown together a quick test and it looks like the reward clipping doesn't have much of an effect on Breakout, at least early on in the training (unclipped in blue, clipped in orange):我已经进行了一个快速测试,看起来奖励裁剪对 Breakout 没有太大影响,至少在训练的早期(未裁剪为蓝色,裁剪为橙色):
I don't know too much about Breakout to comment on its scoring system, but if higher rewards become available later on as we get better performance (as opposed to getting the same small reward but with more frequency, say), we should start seeing the two diverge.我对 Breakout 的评分系统了解不多,无法评论其评分系统,但如果稍后随着我们获得更好的性能(而不是获得相同的小奖励但频率更高)而获得更高的奖励,我们应该开始看到两者分道扬镳。 In such cases, we can still normalize the rewards or convert them to logarithmic scale.
在这种情况下,我们仍然可以将奖励标准化或将其转换为对数刻度。
Here's the configurations I used:这是我使用的配置:
lr: 0.00025
learning_starts: 50000
timesteps_per_iteration: 4
buffer_size: 1000000
train_batch_size: 32
target_network_update_freq: 10000
# (some) rainbow components
n_step: 10
noisy: True
# work-around to remove epsilon-greedy
schedule_max_timesteps: 1
exploration_final_eps: 0
prioritized_replay: True
prioritized_replay_alpha: 0.6
prioritized_replay_beta: 0.4
num_atoms: 51
double_q: False
dueling: False
You may be more interested in their rl-experiments
where they posted some results from their own library against the standard benchmarks along with the configurations where you should be able to get even better performance.您可能对他们的
rl-experiments
更感兴趣,他们在其中发布了一些针对标准基准的库中的结果以及您应该能够获得更好性能的配置。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.