
[英]Why can't my DQN agent find the optimal policy in a non-deterministic environment?
[英]Double DQN Agent can't complete environment outside of training
我已经制作了我的第一个自定义 OpenAi 环境,并且必须构建一个双 DQN 代理来了解它。 目前,该代理可以顺利通过训练并完成剧集,请参见下面的训练输出:
387/1000: episode: 1, duration: 33.743s, episode steps: 387, steps per second: 11, episode reward: 0.708, mean reward: 0.002 [ 0.000, 0.708], mean action: 0.478 [0.000, 1.000], loss: 17436.449507, mae: 257.172026, mean_q: 460.083782, mean_eps: 0.803350
701/1000: episode: 2, duration: 26.474s, episode steps: 314, steps per second: 12, episode reward: 0.908, mean reward: 0.003 [ 0.000, 0.908], mean action: 0.535 [0.000, 1.000], loss: 9028.783300, mae: 292.782444, mean_q: 539.317775, mean_eps: 0.510850
当我需要使用以下代码使用保存的权重测试代理时,就会出现问题:
dqn.test(env,nb_episodes=1,visualize=False)
被认为完成的环境需要完成20个可以通过2个离散动作完成的阵容。 目前,在测试时,代理进入 2 个阵容,然后陷入无限循环并停止工作。 这个问题在非常混乱的训练中不会发生,我认为这是由于我的 TensorFlow/Keras 代码下面的一些错误而发生的。
如果有人能帮助我解决这个问题,我将不胜感激。 请参阅下面的 DQNAgent 训练/测试的完整代码。
import gym
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam
from rl.agents.dqn import DQNAgent
from rl.memory import SequentialMemory
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
file_path = 'C:/Users/MichaelArena/Desktop/OpenAi/DFS Captain/OpenAi_Upload/Captain_Mode_DFS/Captain_Mode_DFS/envs/Simulation_Showdown_Rams_vs_Sea.csv'
df = pd.read_csv(file_path)
#setup the environment
env = gym.make("Captain_Mode_DFS:DFS-Captain-v0", df=df)
#reset the environment
env.reset()
for step in range(20):
env.render(mode="human")
action = env.observation_space.sample()
obs, reward, done, info = env.step(action)
nb_actions = env.observation_space.n
nb_obs = env.obs.shape
model = Sequential()
model.add(Flatten(input_shape=(1,)+ nb_obs))
model.add(Dense(33))
model.add(Activation('relu'))
model.add(Dense(66))
model.add(Activation('relu'))
model.add(Dense(132))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
memory = SequentialMemory(limit=50, window_length=1)
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(),
attr='eps',
value_max=1.0,
value_min=0.10,
value_test=0.05,
nb_steps=1000)
dqn = DQNAgent(model=model, nb_actions=nb_actions,memory=memory,nb_steps_warmup=50,
target_model_update=20,policy=policy, batch_size=6, gamma=0.99)
dqn.compile(Adam(learning_rate=0.01),metrics=['mae'])
dqn.fit(env,nb_steps=1000,visualize=False,verbose=2)
dqn.save_weights(f'cpt_mode_20.h5f',overwrite=True)
dqn.test(env,nb_episodes=1,visualize=False)
env.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.