[英]Poorly initialized target critic
What is a good way to handle the first round of off policy training with Deep Deterministic Policy Gradients? 用“深度确定性策略梯度”处理第一轮关闭策略培训的好方法是什么?
Here is my problem: I initialize all weights with Xavier Initialization and zeros for bias. 这是我的问题:我使用Xavier初始化初始化所有权重,并使用零初始化偏差。 However when computing the critic loss I'm getting an infinite MSE since the difference between
Q_target
and Q_eval
is so large. 但是,在计算评论者损失时,由于
Q_target
和Q_eval
之间的差异Q_target
,所以我得到的MSE是无限的。 Is it a bad idea to just clip this to a very large value? 仅仅将其裁剪为一个很大的值是一个坏主意吗?
Q_target_i = r_i + discount * Q_target(i+1)
critic_loss = MSE(Q_target_i, Q_eval_i)
我通过将评估网络初始化为与目标网络相同来解决此问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.