简体繁体 English

马尔可夫决策过程中的建模动作使用限制

[英]Modelling action use limit in Markov Decision Process

原文 2021-03-22 05:19:25 4 1 reinforcement-learning/ markov-chains/ state-diagram/ markov-decision-process

I have a Markov Decision Process with certain number of states and actions.我有一个包含一定数量的状态和动作的马尔可夫决策过程。 I want to incorporate in my model, an action which can be used only once from any of the states, and when used cannot be used again.我想在我的 model 中加入一个只能在任何状态下使用一次的动作，并且使用后不能再次使用。 How do I model this action in my state diagram?我如何在我的 state 图中执行 model 这个动作？ I thought of having a separate state and using -inf for rewards but none of these seem to work out.我想有一个单独的 state 并使用 -inf 作为奖励，但这些似乎都没有用。 Thanks!谢谢！

1 个解决方案

To satisfy the Markov property you have to include the information whether this action has been used previously in each state, there is no other way around it.为了满足 Markov 属性，您必须在每个 state 中包含之前是否使用过此操作的信息，没有其他方法可以解决。 This will make your state space larger but then your state diagram will then work out as you expect.这将使您的 state 空间更大，但随后您的 state 图将按您预期的那样运行。

Assume that you have three states: S = {1,2,3} and two actions A={1,2} where each of the actions can only be used once from each state. Then you will now have states S = {(1,p1,p2), (2,p1,p2), (3,p1,p2)}, where p1 is a boolean whether action 1 has previously been used in this state and p2 is a boolean that tells whether action 2 has previously been used in this state. This means that in total you will now have 12 states: S={(1,0,0), (1,1,0), (1,0,1), (1,1,1), (2,0,0), (2,1,0), (2,0,1), (2,1,1), (3,0,0), (3,1,0), (3,0,1), (3,1,1)}假设您有三个状态：S = {1,2,3} 和两个动作 A={1,2}，其中每个动作只能在每个 state 中使用一次。那么您现在将拥有状态 S = {( 1,p1,p2), (2,p1,p2), (3,p1,p2)}，其中 p1 是一个 boolean 动作 1 之前是否已经在这个 state 中使用过，p2 是一个 boolean 表示动作 2 是否已经以前在这个 state 中使用过。这意味着你现在总共有 12 个状态：S={(1,0,0), (1,1,0), (1,0,1), (1,1 ,1), (2,0,0), (2,1,0), (2,0,1), (2,1,1), (3,0,0), (3,1,0 ), (3,0,1), (3,1,1)}