[英]What's wrong with Dyna-Q ? (Dyna-Q vs Q-learning)
I implemented the Q-learning algorithm and used it on FrozenLake-v0 on OpenAI gym.我实现了Q-learning算法,并在 OpenAI 健身房的FrozenLake-v0上使用了它。 I am getting 185 total rewards during training and 7333 total rewards during testing in 10000 episodes.
我在 10000 集的训练期间获得了 185 个总奖励,在测试期间获得了 7333 个总奖励。 Is this good?
这个好吗?
Also I tried the Dyna-Q algorithm.我也尝试了Dyna-Q算法。 But it is giving worse performance than Q-learning.
但它的性能比 Q-learning 差。 Approx.
大约。 200 total rewards during training and 700-900 total rewards during testing in 10000 episodes with 50 planning steps.
训练期间的总奖励为 200 个,测试期间的总奖励为 700-900 个,共 10000 集,包含 50 个计划步骤。
Why is this happening?为什么会这样?
Below is the code.下面是代码。 Is something wrong with the code?
代码有问题吗?
# Setup
env = gym.make('FrozenLake-v0')
epsilon = 0.9
lr_rate = 0.1
gamma = 0.99
planning_steps = 0
total_episodes = 10000
max_steps = 100
Training and testing():训练和测试():
while t < max_steps:
action = agent.choose_action(state)
state2, reward, done, info = agent.env.step(action)
# Removed in testing
agent.learn(state, state2, reward, action)
agent.model.add(state, action, state2, reward)
agent.planning(planning_steps)
# Till here
state = state2
def add(self, state, action, state2, reward):
self.transitions[state, action] = state2
self.rewards[state, action] = reward
def sample(self, env):
state, action = 0, 0
# Random visited state
if all(np.sum(self.transitions, axis=1)) <= 0:
state = np.random.randint(env.observation_space.n)
else:
state = np.random.choice(np.where(np.sum(self.transitions, axis=1) > 0)[0])
# Random action in that state
if all(self.transitions[state]) <= 0:
action = np.random.randint(env.action_space.n)
else:
action = np.random.choice(np.where(self.transitions[state] > 0)[0])
return state, action
def step(self, state, action):
state2 = self.transitions[state, action]
reward = self.rewards[state, action]
return state2, reward
def choose_action(self, state):
if np.random.uniform(0, 1) < epsilon:
return self.env.action_space.sample()
else:
return np.argmax(self.Q[state, :])
def learn(self, state, state2, reward, action):
# predict = Q[state, action]
# Q[state, action] = Q[state, action] + lr_rate * (target - predict)
target = reward + gamma * np.max(self.Q[state2, :])
self.Q[state, action] = (1 - lr_rate) * self.Q[state, action] + lr_rate * target
def planning(self, n_steps):
# if len(self.transitions)>planning_steps:
for i in range(n_steps):
state, action = self.model.sample(self.env)
state2, reward = self.model.step(state, action)
self.learn(state, state2, reward, action)
I guess it could be because the environment is stochastic.我想这可能是因为环境是随机的。 Learning the model in stochastic environment may lead to sub-optimal policies.
在随机环境中学习 model 可能会导致次优策略。 In the Sutton & Barto's RLBook they say that they assume deterministic environment.
在 Sutton & Barto 的 RLBook 中,他们说他们假设了确定性环境。
Check that after a model step is taken the planning steps sample from the next state ie state2
.检查在执行 model 步骤之后,是否执行下一个 state 即
state2
中的计划步骤示例。
If not, planning might be taking repeated steps from the same starting state given by self.env
.如果没有,计划可能会从 self.env 给出的相同起始
self.env
开始重复步骤。
However, I may have misunderstood the role of the self.env
parameter in self.model.sample(self.env)
不过我可能误解了
self.model.sample(self.env)
中self.env
参数的作用
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.