简体   繁体   English

Q学习,测验次数对收敛有什么影响?

[英]Q-learning, what is the effect of test episodes count on convergence?

in the following code which is the code for solving the FrozenLake 4x4 by Q-learning. 在以下代码中,这是通过Q学习解决FrozenLake 4x4的代码。 in the training part why are we playing 20 episodes of test environment instead of just 1 in each loop? 在培训部分,为什么我们要播放20集测试环境,而不是每个循环只播放1集? I tried both numbers of iterations: 我尝试了两次迭代:

when playing 20 iterations of the test environment the agent converges in more than 16000 tries. 在测试环境中进行20次迭代时,代理在16000次尝试中收敛。

when playing 1 iteration of test environment the agent converges in less than 1000 try. 在进行1次测试环境迭代时,代理会在不到1000次尝试中收敛。

import gym
import collections
from tensorboardX import SummaryWriter

ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
ALPHA = 0.2
TEST_EPISODES = 20


class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.values = collections.defaultdict(float)

    def sample_env(self):
        action = self.env.action_space.sample()
        old_state = self.state
        new_state, reward, is_done, _ = self.env.step(action)
        self.state = self.env.reset() if is_done else new_state
        return (old_state, action, reward, new_state)

    def best_value_and_action(self, state):
        best_value, best_action = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_value, best_action

    def value_update(self, s, a, r, next_s):
        best_v, _ = self.best_value_and_action(next_s)
        new_val = r + GAMMA * best_v
        old_val = self.values[(s, a)]
        self.values[(s, a)] = old_val * (1-ALPHA) + new_val * ALPHA

    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        while True:
            _, action = self.best_value_and_action(state)
            new_state, reward, is_done, _ = env.step(action)
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward


if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-q-learning")

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        s, a, r, next_s = agent.sample_env()
        agent.value_update(s, a, r, next_s)

        reward = 0.0
        for _ in range(TEST_EPISODES):
            reward += agent.play_episode(test_env)
        reward /= TEST_EPISODES
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
            best_reward = reward
        if reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()

In this example, TEST_EPISODES is used to alter the solve criteria. 在此示例中, TEST_EPISODES用于更改求解标准。 With TEST_EPISODES = 1 the game is considered solved as soon as the most recent game reaches a score of > 0.80 and with TEST_EPISODES = 20 the average score over the last 20 rounds must be > 0.80 to consider the game solved. TEST_EPISODES = 1的情况下,只要最近的游戏得分> 0.80,便认为该游戏已解决;在TEST_EPISODES = 20的情况下,最近20轮的平均得分必须> 0.80,才能考虑该游戏已解决。

Since this game has stochastic actions, ie you don't get the same result for taking the same action in the same state each time, then the higher you drive up TEST_EPISODES , the more robust the solution is likely to be. 由于该游戏具有随机动作,即每次在相同状态下执行相同动作时,您不会获得相同结果,因此您提高TEST_EPISODES的速度TEST_EPISODES ,该解决方案的TEST_EPISODES就越强。 With TEST_EPISODES = 1 this script would consider the game solved if it just happened to randomly find the goal on the first try, but the likelihood of doing that 20 times in a row with a poor model is much less likely. TEST_EPISODES = 1情况下,如果该脚本只是碰巧在第一次尝试中就随机找到了目标,则该脚本将认为该游戏已解决,但是在模型较差的情况下连续执行20次的可能性大大降低。

The average value for a larger number of episodes is often a better metric for these sorts of problems than the speed of reaching the goal for the first time. 相对于首次达到目标的速度,对于此类问题,大量事件的平均值通常是更好的指标。 Imagine if you had to operate in this environment and your life depended on reaching the goal safely, you'd probably want it to learn until that score threshold was much closer to 1. 想象一下,如果您必须在这种环境下工作并且您的生活取决于安全地实现目标,那么您可能希望它学习直到分数阈值非常接近1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM