简体繁体 English

MDP的整形定理

[英]Shaping theorem for MDPs

原文 2022-01-20 19:11:18 9 1 reinforcement-learning/ markov-decision-process

I need help with understanding the shaping theorem for MDPs.我需要帮助来理解 MDP 的整形定理。 Here's the relevant paper: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf it basically says that a markov decision process that has some reward function on transitions between states and actions R(s, a, s') has the same optimal policy as a different markov decision process with it's reward defined as R'(s, a, s') = R(s, a, s') + gamma*f(s') - f(s), where gamma is the time-discount-rate. Here's the relevant paper: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf it basically says that a markov decision process that has some reward function on transitions between states并且动作 R(s, a, s') 与不同的马尔可夫决策过程具有相同的最优策略，其奖励定义为 R'(s, a, s') = R(s, a, s') + gamma* f(s') - f(s)，其中 gamma 是时间贴现率。

I understand the proof, but it seems like a trivial case where it breaks down is when R(s, a, s') = 0 for all states and actions, and the agent is faced with the path A -> s -> B versus A -> r -> t -> B. With the original markov process we get an EV of 0 for both paths, so both paths are optimal.我理解证明，但它似乎是一个微不足道的情况，当所有状态和动作的 R(s, a, s') = 0 时，代理面临路径 A -> s -> B与 A -> r -> t -> B 相比。使用原始马尔可夫过程，我们得到两条路径的 EV 均为 0，因此两条路径都是最优的。 But with the potential added to each transition we get, gamma^2*f(B)-f(A) for the first path, and gamma^3*f(B) - f(A) for the second.但是随着我们得到的每个转换的潜力增加，第一个路径的 gamma^2*f(B)-f(A) 和第二个路径的 gamma^3*f(B) - f(A)。 So if gamma < 1, and 0 < f(B), f(A), then the second path is no longer optimal.因此，如果 gamma < 1，并且 0 < f(B)，f(A)，则第二条路径不再是最优的。

Am I misunderstanding the theorem, or am I making some other mistake?我误解了这个定理，还是我犯了其他错误？

1 个解决方案

You are missing the assumption that for every terminal, and starting state s_T, s_0 we have f(s_T) = f(s_0) = 0. (Note, that in the paper there is an assumption that after terminal state there is always the new starting state, and the potential "wraps around).您错过了这样一个假设，即对于每个终端，从 state s_T, s_0 开始，我们有 f(s_T) = f(s_0) = 0。（注意，在论文中假设在终端 state 之后总是有新的从 state 开始，潜在的“环绕”。

如何在OpenAI体育馆中列出每个州可能的继任州？（仅适用于普通MDP） - How to list possible successor states for each state in OpenAI gym? (strictly for normal MDPs)

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在OpenAI体育馆中列出每个州可能的继任州？（仅适用于普通MDP） - How to list possible successor states for each state in OpenAI gym? (strictly for normal MDPs)

相关标签

MDP的整形定理

问题描述

1 个解决方案

解决方案1
0 2022-01-21 20:43:10

MDP的整形定理

问题描述

1 个解决方案

解决方案1 0 2022-01-21 20:43:10

解决方案1
0 2022-01-21 20:43:10