简体   繁体   English

Dyna-Q 有什么问题? (Dyna-Q 与 Q 学习)

[英]What's wrong with Dyna-Q ? (Dyna-Q vs Q-learning)

I implemented the Q-learning algorithm and used it on FrozenLake-v0 on OpenAI gym.我实现了Q-learning算法,并在 OpenAI 健身房的FrozenLake-v0上使用了它。 I am getting 185 total rewards during training and 7333 total rewards during testing in 10000 episodes.我在 10000 集的训练期间获得了 185 个总奖励,在测试期间获得了 7333 个总奖励。 Is this good?这个好吗?

Also I tried the Dyna-Q algorithm.我也尝试了Dyna-Q算法。 But it is giving worse performance than Q-learning.但它的性能比 Q-learning 差。 Approx.大约。 200 total rewards during training and 700-900 total rewards during testing in 10000 episodes with 50 planning steps.训练期间的总奖励为 200 个,测试期间的总奖励为 700-900 个,共 10000 集,包含 50 个计划步骤。

Why is this happening?为什么会这样?

Below is the code.下面是代码。 Is something wrong with the code?代码有问题吗?

# Setup
env = gym.make('FrozenLake-v0')

epsilon = 0.9
lr_rate = 0.1
gamma = 0.99
planning_steps = 0

total_episodes = 10000
max_steps = 100

Training and testing():训练和测试():

while t < max_steps:
    action = agent.choose_action(state)  
    state2, reward, done, info = agent.env.step(action)  
    # Removed in testing
    agent.learn(state, state2, reward, action)
    agent.model.add(state, action, state2, reward)
    agent.planning(planning_steps)
    # Till here
    state = state2
def add(self, state, action, state2, reward):
        self.transitions[state, action] = state2
        self.rewards[state, action] = reward

def sample(self, env):
    state, action = 0, 0
    # Random visited state
    if all(np.sum(self.transitions, axis=1)) <= 0:
        state = np.random.randint(env.observation_space.n)
    else:
        state = np.random.choice(np.where(np.sum(self.transitions, axis=1) > 0)[0])

    # Random action in that state
    if all(self.transitions[state]) <= 0:
        action = np.random.randint(env.action_space.n)
    else:    
        action = np.random.choice(np.where(self.transitions[state] > 0)[0])
    return state, action

def step(self, state, action):
    state2 = self.transitions[state, action]
    reward = self.rewards[state, action]
    return state2, reward

def choose_action(self, state):
    if np.random.uniform(0, 1) < epsilon:
        return self.env.action_space.sample()
    else:
        return np.argmax(self.Q[state, :])

def learn(self, state, state2, reward, action):
    # predict = Q[state, action]
    # Q[state, action] = Q[state, action] + lr_rate * (target - predict)
    target = reward + gamma * np.max(self.Q[state2, :])
    self.Q[state, action] = (1 - lr_rate) * self.Q[state, action] + lr_rate * target

def planning(self, n_steps):
    # if len(self.transitions)>planning_steps:
    for i in range(n_steps):
        state, action =  self.model.sample(self.env)
        state2, reward = self.model.step(state, action)
        self.learn(state, state2, reward, action)

I guess it could be because the environment is stochastic.我想这可能是因为环境是随机的。 Learning the model in stochastic environment may lead to sub-optimal policies.在随机环境中学习 model 可能会导致次优策略。 In the Sutton & Barto's RLBook they say that they assume deterministic environment.在 Sutton & Barto 的 RLBook 中,他们说他们假设了确定性环境。

Check that after a model step is taken the planning steps sample from the next state ie state2 .检查在执行 model 步骤之后,是否执行下一个 state 即state2中的计划步骤示例。

If not, planning might be taking repeated steps from the same starting state given by self.env .如果没有,计划可能会从 self.env 给出的相同起始self.env开始重复步骤。

However, I may have misunderstood the role of the self.env parameter in self.model.sample(self.env)不过我可能误解了self.model.sample(self.env)self.env参数的作用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM