简体   繁体   English

让 Actor 和 Critic 使用截然不同的模型有好处吗?

[英]Are there benefits to having Actor and Critic use significantly different models?

In Actor-Critic methods the Actor and Critic are assigned two complimentary, but different goals.在 Actor-Critic 方法中,Actor 和 Critic 被分配了两个互补但不同的目标。 I'm trying to understand whether the differences between these goals (updating a policy and updating a value function) are large enough to warrant different models for the Actor and Critic, or if they are of similar enough complexity that the same model should be reused for simplicity.我试图了解这些目标(更新策略和更新价值函数)之间的差异是否足够大以保证演员和评论家的不同模型,或者它们是否具有足够相似的复杂性以至于应该重用相同的 model为简单起见。 I realize that this could be very situational, but not in what way.我意识到这可能是非常有情境的,但不是以什么方式。 For example, does the balance shift as the model complexity grows?例如,随着 model 复杂性的增加,平衡会发生变化吗?

Please let me know if there are any rules of thumb for this, or if you know of a specific publication that addresses the issue.如果对此有任何经验法则,或者您是否知道解决该问题的特定出版物,请告诉我。

The empirical results suggest the exact opposite - that it is important to have the same .network doing both (up to some final layer/head).实证结果表明恰恰相反——让相同的.network 执行这两项操作很重要(直到最后一层/头部)。 The main reason for this is that learning value.network (critis) provides signal for shaping represntation of the policy (actor) that otherwise would be nearly impossible to get.这样做的主要原因是学习 value.network (critis) 为塑造策略(参与者)的代表提供了信号,否则几乎不可能获得。

In fact if you think about these, these are extremely similar goals, since for optimal deterministic policy事实上,如果你考虑这些,这些是非常相似的目标,因为对于最佳确定性策略

pi(s) = arg max_a Q(s, a) = arg max_a V(T(s, a))

where T is the transition dynamics.其中 T 是过渡动力学。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM