简体   繁体   中英

Q-learning, what is the effect of test episodes count on convergence?

in the following code which is the code for solving the FrozenLake 4x4 by Q-learning. in the training part why are we playing 20 episodes of test environment instead of just 1 in each loop? I tried both numbers of iterations:

when playing 20 iterations of the test environment the agent converges in more than 16000 tries.

when playing 1 iteration of test environment the agent converges in less than 1000 try.

import gym
import collections
from tensorboardX import SummaryWriter

ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
ALPHA = 0.2
TEST_EPISODES = 20


class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.values = collections.defaultdict(float)

    def sample_env(self):
        action = self.env.action_space.sample()
        old_state = self.state
        new_state, reward, is_done, _ = self.env.step(action)
        self.state = self.env.reset() if is_done else new_state
        return (old_state, action, reward, new_state)

    def best_value_and_action(self, state):
        best_value, best_action = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_value, best_action

    def value_update(self, s, a, r, next_s):
        best_v, _ = self.best_value_and_action(next_s)
        new_val = r + GAMMA * best_v
        old_val = self.values[(s, a)]
        self.values[(s, a)] = old_val * (1-ALPHA) + new_val * ALPHA

    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        while True:
            _, action = self.best_value_and_action(state)
            new_state, reward, is_done, _ = env.step(action)
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward


if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-q-learning")

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        s, a, r, next_s = agent.sample_env()
        agent.value_update(s, a, r, next_s)

        reward = 0.0
        for _ in range(TEST_EPISODES):
            reward += agent.play_episode(test_env)
        reward /= TEST_EPISODES
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
            best_reward = reward
        if reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()

In this example, TEST_EPISODES is used to alter the solve criteria. With TEST_EPISODES = 1 the game is considered solved as soon as the most recent game reaches a score of > 0.80 and with TEST_EPISODES = 20 the average score over the last 20 rounds must be > 0.80 to consider the game solved.

Since this game has stochastic actions, ie you don't get the same result for taking the same action in the same state each time, then the higher you drive up TEST_EPISODES , the more robust the solution is likely to be. With TEST_EPISODES = 1 this script would consider the game solved if it just happened to randomly find the goal on the first try, but the likelihood of doing that 20 times in a row with a poor model is much less likely.

The average value for a larger number of episodes is often a better metric for these sorts of problems than the speed of reaching the goal for the first time. Imagine if you had to operate in this environment and your life depended on reaching the goal safely, you'd probably want it to learn until that score threshold was much closer to 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM