简体   繁体   English

Keras代码Q-learning OpenAI健身房FrozenLake出了点问题

[英]Something wrong with Keras code Q-learning OpenAI gym FrozenLake

Maybe my question will seem stupid. 也许我的问题看起来很愚蠢。

I'm studying the Q-learning algorithm. 我正在研究Q学习算法。 In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code. 为了更好地理解它,我试图将此FrozenLake示例的Tenzorflow代码重新编译为Keras代码。

My code: 我的代码:

import gym
import numpy as np
import random

from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K    

import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('FrozenLake-v0')

model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))

def custom_loss(yTrue, yPred):
    return K.sum(K.square(yTrue - yPred))

model.compile(loss=custom_loss, optimizer='sgd')

# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []

num_episodes = 2000
for i in range(num_episodes):
    current_state = env.reset()
    rAll = 0
    d = False
    j = 0
    while j < 99:
        j+=1

        current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)
        action = np.reshape(np.argmax(current_state_Q_values), (1,))

        if np.random.rand(1) < e:
            action[0] = env.action_space.sample() #random action

        new_state, reward, d, _ = env.step(action[0])

        rAll += reward
        jList.append(j)
        rList.append(rAll)

        new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)
        max_newQ = np.max(new_Qs)

        targetQ = current_state_Q_values
        targetQ[0,action[0]] = reward + y*max_newQ
        model.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)
        current_state = new_state

        if d == True:
            #Reduce chance of random action as we train the model.
            e = 1./((i/50) + 10)
            break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")

When I run it, it doesn't work well: Percent of succesful episodes: 0.052% 当我运行它时,效果不佳:成功集数的百分比:0.052%

plt.plot(rList)

在此输入图像描述

The original Tensorflow code is much more better: Percent of succesful episodes: 0.352% 最初的Tensorflow代码要好得多:成功剧集的百分比:0.352%

plt.plot(rList)

在此输入图像描述

What have I done wrong ? 我做错了什么?

Besides setting use_bias=False as @Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (eg 0.5, 0.75)? 除了在评论中提到的将use_bias = False设置为@Maldus之外,您可以尝试的另一件事是从更高的epsilon值(例如0.5,0.75)开始? A trick might be to only decrease the epsilon value IF you reach the goal. 一个技巧可能只是在达到目标时减少epsilon值。 ie don't decrease epsilon on the end of every episode. 即不要在每集结束时减少epsilon。 That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter. 这样你的玩家可以随机地继续探索地图,直到它开始收敛于一条好的路线,然后减少epsilon参数是个好主意。

I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. 其实我已经在这个实现在keras一个类似的模型要点使用卷积层而不是致密层。 Managed to get it to work in under 2000 episodes. 管理以使其在2000集以下的情况下工作。 Might be of some help to others :) 可能对别人有所帮助:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM