玩CartPole时Keras Q学习模型的性能没有提高

Question

I'm trying to train a deep Q-learning Keras model to play CartPole-v1. 我正在尝试训练一个深入的Q学习Keras模型来玩CartPole-v1。 However, it doesn't seem to get any better. 但是，它似乎并没有变得更好。 I don't believe it's a bug but rather my lack of knowledge on how to use Keras and OpenAI Gym properly. 我不相信这是一个错误，但是我缺乏如何正确使用Keras和OpenAI Gym的知识。 I am following this tutorial ( https://adventuresinmachinelearning.com/reinforcement-learning-tutorial-python-keras/ ), which shows how to train a bot to play NChain-v0 (which I was able to follow), but now I am trying to apply what I learned to a more complex environment: CartPole-v1. 我正在关注本教程（ https://adventuresinmachinelearning.com/reinforcement-learning-tutorial-python-keras/ ），该教程显示了如何训练机器人玩NChain-v0（我能够遵循）。正在尝试将我学到的知识应用到更复杂的环境：CartPole-v1。 Here is the code below: 这是下面的代码：

###import libraries
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam


###prepare environment
env = gym.make('CartPole-v1') #our environment is CartPole-v1


###make model
model = Sequential()
model.add(Dense(128, input_shape=(env.observation_space.shape[0],), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(env.action_space.n, activation='linear'))
model.compile(loss='mse', optimizer=Adam(), metrics=['mae'])


###train model
def train_model(n_episodes=500, epsilon=0.5, decay_factor=0.999, gamma=0.95):
    G_array = []
    for episode in range(n_episodes):
        observation = env.reset()
        observation = observation.reshape(-1, env.observation_space.shape[0])
        epsilon *= decay_factor
        G = 0
        done = False
        while done != True:
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(model.predict(observation))
            new_observation, reward, done, info = env.step(action) #It keeps going left! Why though?
            new_observation = new_observation.reshape(-1, env.observation_space.shape[0])
            target = reward + gamma*np.max(model.predict(new_observation))
            target_vector = model.predict(observation)[0]
            target_vector[action] = target
            model.fit(observation, target_vector.reshape(-1, env.action_space.n), epochs=1, verbose=0)
            observation = new_observation
            G += reward
        G_array.append(G)

    return G_array

G_array = train_model()
print(G_array)

The output for the 'G_array' (the total reward for each game) is the following: “ G_array”（每个游戏的总奖励）的输出如下：

[14.0, 16.0, 18.0, 12.0, 16.0, 14.0, 17.0, 11.0, 11.0, 12.0, 11.0, 15.0, 13.0, 12.0, 12.0, 19.0, 13.0, 9.0, 10.0, 10.0, 11.0, 11.0, 14.0, 11.0, 10.0, 9.0, 10.0, 10.0, 12.0, 9.0, 15.0, 19.0, 11.0, 11.0, 10.0, 11.0, 13.0, 12.0, 13.0, 16.0, 12.0, 14.0, 9.0, 12.0, 20.0, 10.0, 12.0, 11.0, 9.0, 13.0, 13.0, 11.0, 13.0, 11.0, 24.0, 12.0, 11.0, 9.0, 9.0, 11.0, 10.0, 16.0, 10.0, 9.0, 9.0, 19.0, 10.0, 11.0, 13.0, 11.0, 11.0, 14.0, 23.0, 8.0, 13.0, 12.0, 15.0, 14.0, 11.0, 24.0, 9.0, 11.0, 11.0, 11.0, 10.0, 12.0, 11.0, 11.0, 10.0, 13.0, 18.0, 10.0, 17.0, 11.0, 13.0, 14.0, 12.0, 16.0, 13.0, 10.0, 10.0, 12.0, 22.0, 13.0, 11.0, 14.0, 10.0, 11.0, 11.0, 14.0, 14.0, 12.0, 18.0, 17.0, 9.0, 13.0, 12.0, 11.0, 11.0, 9.0, 16.0, 9.0, 18.0, 15.0, 12.0, 16.0, 13.0, 10.0, 13.0, 13.0, 17.0, 11.0, 11.0, 9.0, 9.0, 12.0, 9.0, 10.0, 9.0, 10.0, 18.0, 9.0, 11.0, 12.0, 10.0, 10.0, 10.0, 12.0, 12.0, 20.0, 13.0, 19.0, 9.0, 14.0, 14.0, 13.0, 19.0, 10.0, 18.0, 11.0, 11.0, 11.0, 8.0, 10.0, 14.0, 11.0, 16.0, 11.0, 13.0, 13.0, 9.0, 16.0, 11.0, 12.0, 13.0, 12.0, 11.0, 10.0, 11.0, 21.0, 12.0, 22.0, 12.0, 10.0, 13.0, 15.0, 19.0, 11.0, 10.0, 10.0, 11.0, 22.0, 11.0, 9.0, 26.0, 13.0, 11.0, 13.0, 13.0, 10.0, 10.0, 11.0, 12.0, 18.0, 9.0, 11.0, 13.0, 12.0, 13.0, 13.0, 12.0, 10.0, 11.0, 12.0, 12.0, 17.0, 11.0, 13.0, 13.0, 21.0, 12.0, 9.0, 14.0, 10.0, 15.0, 12.0, 12.0, 14.0, 11.0, 10.0, 14.0, 12.0, 12.0, 11.0, 8.0, 24.0, 9.0, 13.0, 10.0, 14.0, 10.0, 12.0, 13.0, 12.0, 13.0, 13.0, 14.0, 9.0, 17.0, 16.0, 9.0, 16.0, 14.0, 11.0, 9.0, 10.0, 15.0, 11.0, 9.0, 14.0, 12.0, 10.0, 13.0, 10.0, 10.0, 16.0, 15.0, 11.0, 8.0, 9.0, 9.0, 10.0, 9.0, 21.0, 13.0, 13.0, 10.0, 10.0, 11.0, 27.0, 13.0, 15.0, 11.0, 11.0, 12.0, 9.0, 10.0, 16.0, 10.0, 13.0, 13.0, 12.0, 12.0, 11.0, 17.0, 14.0, 9.0, 15.0, 26.0, 9.0, 9.0, 13.0, 9.0, 8.0, 12.0, 9.0, 10.0, 11.0, 9.0, 10.0, 9.0, 11.0, 9.0, 10.0, 12.0, 13.0, 13.0, 11.0, 11.0, 10.0, 15.0, 11.0, 11.0, 13.0, 10.0, 10.0, 12.0, 10.0, 10.0, 12.0, 9.0, 15.0, 29.0, 11.0, 9.0, 18.0, 11.0, 13.0, 13.0, 16.0, 13.0, 15.0, 10.0, 11.0, 18.0, 9.0, 9.0, 11.0, 15.0, 11.0, 11.0, 10.0, 25.0, 10.0, 9.0, 11.0, 15.0, 15.0, 11.0, 11.0, 11.0, 13.0, 9.0, 11.0, 9.0, 13.0, 12.0, 12.0, 14.0, 11.0, 14.0, 8.0, 10.0, 13.0, 10.0, 10.0, 10.0, 9.0, 13.0, 9.0, 12.0, 10.0, 11.0, 9.0, 11.0, 12.0, 20.0, 9.0, 10.0, 14.0, 9.0, 12.0, 13.0, 11.0, 11.0, 11.0, 10.0, 15.0, 14.0, 14.0, 12.0, 13.0, 12.0, 11.0, 10.0, 12.0, 12.0, 9.0, 11.0, 9.0, 11.0, 13.0, 10.0, 11.0, 11.0, 11.0, 12.0, 13.0, 13.0, 12.0, 8.0, 11.0, 13.0, 9.0, 12.0, 10.0, 10.0, 15.0, 12.0, 11.0, 10.0, 17.0, 10.0, 14.0, 9.0, 10.0, 10.0, 10.0, 12.0, 10.0, 10.0, 12.0, 10.0, 15.0, 10.0, 10.0, 9.0, 10.0, 10.0, 10.0, 19.0, 9.0, 10.0, 11.0, 10.0, 11.0, 11.0, 13.0, 10.0, 11.0, 12.0, 11.0, 12.0, 13.0, 11.0, 8.0, 12.0, 12.0, 14.0, 14.0, 11.0, 9.0, 11.0, 9.0, 12.0, 9.0, 8.0, 9.0, 12.0, 8.0, 10.0, 11.0, 13.0, 12.0, 12.0, 10.0, 11.0, 12.0, 10.0, 12.0, 13.0, 9.0, 9.0, 10.0, 15.0, 14.0, 16.0, 8.0, 19.0, 10.0]

This apparently means the model did not improve at all for all 500 episodes. 这显然意味着该模型在所有500个情节中都没有改善。 Excuse me if I am a complete beginner at using Keras and OpenAI Gym (especially Keras). 如果我是使用Keras和OpenAI Gym（尤其是Keras）的完整入门者，请问。 Any help is appreciated. 任何帮助表示赞赏。 Thank you. 谢谢。

UPDATE: Through some debugging, I've recently noticed that the model tends to go left, or choose action 0, most of the time. 更新：通过一些调试，我最近注意到该模型通常在大多数情况下倾向于左移或选择操作0。 Does that mean I should make some if-statements to modify the reward system (eg increase the reward if the pole angle is less than 5 degrees)? 这是否意味着我应该做出一些if陈述来修改奖励系统（例如，如果杆角小于5度，则增加奖励）？ In fact, I am doing that right now, but to no avail so far. 实际上，我现在正在执行此操作，但到目前为止仍无济于事。

Answer 1

Reinforcement learning is very noisy and your batch size is 1 which makes it even noisier. 强化学习非常嘈杂，您的批量大小为1，这使其变得更加嘈杂。 You can try to use a memory buffer of past episodes/updates which you update. 您可以尝试使用过去更新的情节/更新的内存缓冲区。 You could use something like deque() from collections for this buffer. 您可以使用集合中类似deque（）的缓冲区。 Then you randomly sample from this memory buffer according to a given batch-size. 然后，您可以根据给定的批量从该内存缓冲区中随机采样。 I found this repo to be very helpful (it includes a replay/memory buffer and a RL agent as you need it) https://github.com/udacity/deep-reinforcement-learning/tree/master/dqn Nevertheless, RL takes a long time to converge, unlike conventional deep learning where the loss decreases very fast in the beginning, in RL the reward will not increase for a long time and then suddenly start increasing. 我发现此存储库非常有用（根据需要它包括重播/内存缓冲区和RL代理） https://github.com/udacity/deep-reinforcement-learning/tree/master/dqn不过，RL需要与传统的深度学习不同，传统的深度学习在开始时损失很快就减少了，而在RL中，奖励将不会长时间增加，然后突然开始增加。

玩CartPole时Keras Q学习模型的性能没有提高

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-06-29 11:20:06

玩CartPole时Keras Q学习模型的性能没有提高

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-06-29 11:20:06

解决方案1
1 已采纳 2019-06-29 11:20:06