简体   繁体   English

rllib DQN 实施中的 Atari 分数与奖励

[英]Atari score vs reward in rllib DQN implementation

I'm trying to replicate DQN scores for Breakout using RLLib.我正在尝试使用 RLLib 复制 Breakout 的 DQN 分数。 After 5M steps the average reward is 2.0 while the known score for Breakout using DQN is 100+.在 500 万步之后,平均奖励为 2.0,而使用 DQN 的 Breakout 已知得分为 100+。 I'm wondering if this is because of reward clipping and therefore actual reward does not correspond to score from Atari.我想知道这是否是因为奖励削减,因此实际奖励与 Atari 的分数不对应。 In OpenAI baselines, the actual score is placed in info['r'] the reward value is actually the clipped value.在 OpenAI 基线中,实际分数放在info['r']中,奖励值实际上是裁剪值。 Is this the same case for RLLib? RLLib 的情况是否相同? Is there any way to see actual average score while training?有没有办法在训练时查看实际的平均分数

According to the list of trainer parameters , the library will clip Atari rewards by default:根据训练器参数列表,库会默认裁剪 Atari 奖励:

# Whether to clip rewards prior to experience postprocessing. Setting to
# None means clip for Atari only.
"clip_rewards": None,

However, the episode_reward_mean reported on tensorboard should still correspond to the actual, non-clipped scores.但是,张量板上报告的episode_reward_mean仍应对应于实际的非剪辑分数。


While the average score of 2 is not much at all relative to the benchmarks for Breakout, 5M steps may not be large enough for DQN unless you are employing something akin to the rainbow to significantly speed things up.虽然 2 的平均分数与 Breakout 的基准相比根本不高,但 5M 步对于 DQN 来说可能不够大,除非您使用类似于彩虹的东西来显着加快速度。 Even then, DQN is notoriously slow to converge , so you may want to check your results using a longer run instead and/or consider upgrading your DQN configurations.即使这样, DQN 的收敛速度也是出了名的慢,因此您可能希望使用更长的运行时间来检查结果和/或考虑升级您的 DQN 配置。

I've thrown together a quick test and it looks like the reward clipping doesn't have much of an effect on Breakout, at least early on in the training (unclipped in blue, clipped in orange):我已经进行了一个快速测试,看起来奖励裁剪对 Breakout 没有太大影响,至少在训练的早期(未裁剪为蓝色,裁剪为橙色): 在此处输入图像描述

I don't know too much about Breakout to comment on its scoring system, but if higher rewards become available later on as we get better performance (as opposed to getting the same small reward but with more frequency, say), we should start seeing the two diverge.我对 Breakout 的评分系统了解不多,无法评论其评分系统,但如果稍后随着我们获得更好的性能(而不是获得相同的小奖励但频率更高)而获得更高的奖励,我们应该开始看到两者分道扬镳。 In such cases, we can still normalize the rewards or convert them to logarithmic scale.在这种情况下,我们仍然可以将奖励标准化或将其转换为对数刻度。

Here's the configurations I used:这是我使用的配置:

lr: 0.00025
learning_starts: 50000
timesteps_per_iteration: 4
buffer_size: 1000000
train_batch_size: 32
target_network_update_freq: 10000
# (some) rainbow components
n_step: 10
noisy: True
# work-around to remove epsilon-greedy
schedule_max_timesteps: 1
exploration_final_eps: 0
prioritized_replay: True
prioritized_replay_alpha: 0.6
prioritized_replay_beta: 0.4
num_atoms: 51
double_q: False
dueling: False

You may be more interested in their rl-experiments where they posted some results from their own library against the standard benchmarks along with the configurations where you should be able to get even better performance.您可能对他们的rl-experiments更感兴趣,他们在其中发布了一些针对标准基准的库中的结果以及您应该能够获得更好性能的配置

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 DQN Atari 与 tensorflow:训练似乎卡住了 - DQN Atari with tensorflow: Training seems to stuck DQN 在 Atari Pong 任务中陷入次优策略 - DQN stuck at suboptimal policy in Atari Pong task DQN的奖励function的设计原理是什么? - What's the principle to design the reward function, of DQN? 如何确定在 DQN model 中使用正奖励还是负奖励? - How to determine whether use positive or negative reward in DQN model? DQN 如何在奖励始终为 -1 的环境中工作 - How does DQN work in an environment where reward is always -1 为什么 Cartpole 游戏的 DQN 奖励上升而损失不下降? - Why DQN for cartpole game has a ascending reward while loss is not descending? 训练一段时间后,DQN平均奖励减少 - DQN average reward decrease after training for a period of time RAY - RLLIB - 无法使用离线样本批次训练 DQN - episode_len_mean:.nan 值 - RAY - RLLIB - Failing to train DQN using offline sample batch - episode_len_mean: .nan value DeepMind-Atari-Deep-Q-Learner(DQN)除了突破之外不能运行游戏roms - DeepMind-Atari-Deep-Q-Learner (DQN) can not run game roms other than breakout 使用 Atari Games 进行深度强化学习:一个 DQN 用于所有游戏还是 49 个 DQN 用于 49 个游戏? - Deep Reinforcement Learning with Atari Games: one DQN for all games or 49 DQNs for 49 games?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM