简体   繁体   English

MDP的整形定理

[英]Shaping theorem for MDPs

I need help with understanding the shaping theorem for MDPs.我需要帮助来理解 MDP 的整形定理。 Here's the relevant paper: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf it basically says that a markov decision process that has some reward function on transitions between states and actions R(s, a, s') has the same optimal policy as a different markov decision process with it's reward defined as R'(s, a, s') = R(s, a, s') + gamma*f(s') - f(s), where gamma is the time-discount-rate. Here's the relevant paper: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf it basically says that a markov decision process that has some reward function on transitions between states并且动作 R(s, a, s') 与不同的马尔可夫决策过程具有相同的最优策略,其奖励定义为 R'(s, a, s') = R(s, a, s') + gamma* f(s') - f(s),其中 gamma 是时间贴现率。

I understand the proof, but it seems like a trivial case where it breaks down is when R(s, a, s') = 0 for all states and actions, and the agent is faced with the path A -> s -> B versus A -> r -> t -> B. With the original markov process we get an EV of 0 for both paths, so both paths are optimal.我理解证明,但它似乎是一个微不足道的情况,当所有状态和动作的 R(s, a, s') = 0 时,代理面临路径 A -> s -> B与 A -> r -> t -> B 相比。使用原始马尔可夫过程,我们得到两条路径的 EV 均为 0,因此两条路径都是最优的。 But with the potential added to each transition we get, gamma^2*f(B)-f(A) for the first path, and gamma^3*f(B) - f(A) for the second.但是随着我们得到的每个转换的潜力增加,第一个路径的 gamma^2*f(B)-f(A) 和第二个路径的 gamma^3*f(B) - f(A)。 So if gamma < 1, and 0 < f(B), f(A), then the second path is no longer optimal.因此,如果 gamma < 1,并且 0 < f(B),f(A),则第二条路径不再是最优的。

Am I misunderstanding the theorem, or am I making some other mistake?我误解了这个定理,还是我犯了其他错误?

You are missing the assumption that for every terminal, and starting state s_T, s_0 we have f(s_T) = f(s_0) = 0. (Note, that in the paper there is an assumption that after terminal state there is always the new starting state, and the potential "wraps around).您错过了这样一个假设,即对于每个终端,从 state s_T, s_0 开始,我们有 f(s_T) = f(s_0) = 0。(注意,在论文中假设在终端 state 之后总是有新的从 state 开始,潜在的“环绕”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM