简体   繁体   English

目标评论家初始化不佳

[英]Poorly initialized target critic

What is a good way to handle the first round of off policy training with Deep Deterministic Policy Gradients? 用“深度确定性策略梯度”处理第一轮关闭策略培训的好方法是什么?

Here is my problem: I initialize all weights with Xavier Initialization and zeros for bias. 这是我的问题:我使用Xavier初始化初始化所有权重,并使用零初始化偏差。 However when computing the critic loss I'm getting an infinite MSE since the difference between Q_target and Q_eval is so large. 但是,在计算评论者损失时,由于Q_targetQ_eval之间的差异Q_target ,所以我得到的MSE是无限的。 Is it a bad idea to just clip this to a very large value? 仅仅将其裁剪为一个很大的值是一个坏主意吗?

Q_target_i = r_i + discount * Q_target(i+1)
critic_loss = MSE(Q_target_i, Q_eval_i)

我通过将评估网络初始化为与目标网络相同来解决此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM