编码马尔可夫决策过程的问题

Question

I am trying to code Markov-Decision Process (MDP) and I face with some problem. 我正在尝试编写Markov-Decision Process（MDP），我遇到了一些问题。 Could you please check my code and find why it isn't works 你可以检查我的代码，找出它不起作用的原因

I have tried to do make it with some small data and it works and give me necessary results, which I feel is correct. 我试图用一些小数据来做它并且它起作用并给我必要的结果，我觉得这是正确的。 But my problem is with generalising of this code. 但我的问题是这个代码的推广。 Yeah, I know about MDP library, but I need to code this one. 是的，我知道MDP库，但我需要编写这个代码。 This code works and I want same result in class: 这段代码有效，我想在课堂上得到同样的结果：

import pandas as pd
data = [['3 0', 'UP', 0.6, '3 1', 5, 'YES'], ['3 0', 'UP', 0.4, '3 2', -10, 'YES'], \
    ['3 0', 'RIGHT', 1, '3 3', 10, 'YES'], ['3 1', 'RIGHT', 1, '3 3', 4, 'NO'], \
    ['3 2', 'DOWN', 0.6, '3 3', 3, 'NO'], ['3 2', 'DOWN', 0.4, '3 1', 5, 'NO'], \
    ['3 3', 'RIGHT', 1, 'EXIT', 7, 'NO'], ['EXIT', 'NO', 1, 'EXIT', 0, 'NO']]

df = pd.DataFrame(data, columns = ['Start', 'Action', 'Probability', 'End', 'Reward', 'Policy'], \
                  dtype = float) #initial matrix

point_3_0, point_3_1, point_3_2, point_3_3, point_EXIT = 0, 0, 0, 0, 0

gamma = 0.9 #it is a discount factor

for i in range(100): 
    point_3_0 = gamma * max(0.6 * (point_3_1 + 5) + 0.4 * (point_3_2 - 10), point_3_3 + 10)
    point_3_1 = gamma * (point_3_3 + 4)
    point_3_2 = gamma * (0.6 * (point_3_3 + 3) + 0.4 * (point_3_1 + 5))
    point_3_3 = gamma * (point_EXIT + 7)


print(point_3_0, point_3_1, point_3_2, point_3_3, point_EXIT)

But here I have a mistake somewhere and it look like too complex? 但是在这里我有一个错误，看起来太复杂了？ Could you please help me with this issue?! 你能帮我解决这个问题吗？！

gamma = 0.9

class MDP:

    def __init__(self, gamma, table):
        self.gamma = gamma
        self.table = table

    def Action(self, state):
        return self.table[self.table.Start == state].Action.values

    def Probability(self, state):
        return self.table[self.table.Start == state].Probability.values

    def End(self, state):
        return self.table[self.table.Start == state].End.values

    def Reward(self, state):
        return self.table[self.table.Start == state].Reward.values

    def Policy(self, state):
        return self.table[self.table.Start == state].Policy.values

mdp = MDP(gamma = gamma, table = df)

def value_iteration():
    states = mdp.table.Start.values
    actions = mdp.Action
    probabilities = mdp.Probability
    ends = mdp.End
    rewards = mdp.Reward
    policies = mdp.Policy

    V1 = {s: 0 for s in states}
    for i in range(100):
        V = V1.copy()
        for s in states:
            if policies(s) == 'YES':
                V1[s] = gamma * max(rewards(s) + [sum([p * V[s1] for (p, s1) \
                in zip(probabilities(s), ends(s))][actions(s)==a]) for a in set(actions(s))])
            else: 
                sum(probabilities[s] * ends(s))

    return V

value_iteration()

I expect values in every point, but get: ValueError: The truth value of an array with more than one element is ambiguous. 我希望每个点都有值，但得到：ValueError：具有多个元素的数组的真值是不明确的。 Use a.any() or a.all() 使用a.any（）或a.all（）

Answer 1

You get the error, because policies(s) = ['YES' 'YES' 'YES'], so it contains 'YES' three times. 您收到错误，因为policy（s）= ['YES''YES''YES']，因此它包含'YES'三次。 If you want to check, if all elements in policies(s) are 'YES', simply replace policies(s) == 'YES' with all(x=='YES' for x in policies(s)) 如果要检查，如果策略中的所有元素都为“是”，则只需将policies(s) == 'YES'替换为all(x=='YES' for x in policies(s))

If you only want to check for the first element, change to policies(s)[0] == 'YES' 如果您只想检查第一个元素，请更改为policies(s)[0] == 'YES'

See the Post check if all elements in a list are identical for different approaches. 请参阅“ 检查 ”，检查列表中的所有元素是否对于不同方法是相同的。

Answer 2

For the second problem described (assuming (policies(s) == YES).any() fixed the 1st problem) notice that you initialize a regular python list with this expresion 对于描述的第二个问题（假设(policies(s) == YES).any()修复了第一个问题）注意你用这个表达式初始化一个常规的python列表

[sum([p * V[s1] for (p, s1) in zip(probabilities(s), ends(s))]

which you then try to access with the indices [actions(s)==a] python lists don't support multiple indexing, and this cause the TypeError you encountered 然后你尝试使用索引访问[actions(s)==a] python列表不支持多索引，这会导致你遇到的TypeError

编码马尔可夫决策过程的问题

问题描述

2 个解决方案

解决方案1
0 2019-06-23 18:41:14

解决方案2
0 2019-06-23 19:37:34

编码马尔可夫决策过程的问题

问题描述

2 个解决方案

解决方案1 0 2019-06-23 18:41:14

解决方案2 0 2019-06-23 19:37:34

解决方案1
0 2019-06-23 18:41:14

解决方案2
0 2019-06-23 19:37:34