[英]Why does randomizing samples of reinforcement learning model with a non-linear function approximator reduce variance?
I have read the DQN thesis.我已经阅读了 DQN 论文。
While reading the DQN paper, I found that randomly selecting and learning samples reduced divergence in RL using a non-linier function approximator.在阅读 DQN 论文时,我发现随机选择和学习样本使用非线性函数逼近器减少了 RL 的散度。
If so, why is the learning of RL using a non-linier function approximator divergent when the input data are strongly correlated?如果是这样,当输入数据强相关时,为什么使用非线性函数逼近器学习 RL 会发散?
I believe that Section X (starting on page 687) of An Analysis Of Temporal-Difference Learning with Function Approximation provides an answer to your question.我相信An Analysis Of Temporal-Difference Learning with Function Approximation 的第X节(从第 687 页开始)为您的问题提供了答案。 In summary, there exist nonlinear functions whose average prediction error actually increases after applying the TD(0) Bellman operator;
综上所述,存在应用TD(0)Bellman算子后平均预测误差实际上增加的非线性函数; hence, the policy will eventually diverge.
因此,政策最终会出现分歧。 This is generally the case for deep neural networks because they are inherently nonlinear and tend to be poorly behaved from an optimization perspective.
这通常是深度神经网络的情况,因为它们本质上是非线性的,并且从优化的角度来看往往表现不佳。
Alternatively, training on independent and identically distributed ( iid ) data makes it possible to compute unbiased estimates of the gradient, which is required for first-order optimization algorithms like Stochastic Gradient Descent (SGD) to converge to a local minimum of the loss function.或者,对独立同分布( iid ) 数据进行训练可以计算梯度的无偏估计,这是一阶优化算法(如随机梯度下降 (SGD))收敛到损失函数的局部最小值所必需的。 This is why DQN samples random minibatches from a large replay memory then reduces the loss using RMSProp (an advanced form of SGD).
这就是为什么 DQN 从大型重放内存中随机采样小批量,然后使用RMSProp (SGD 的高级形式)减少损失的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.