简体   繁体   中英

DQN not working Properly

I am trying to write my own DQN in Python, using keras. I think my logic is correct. I am trying it on the CartPole environment, but the rewards are not increasing after 50,000 episodes. Any help will be appreciated. Currently I am not looking to the dueling or the Double DQN part.

class ReplayBuffer:
        def __init__(self, size=100000):
            self.buffer=deque(maxlen=size)

        def sample(self, sample_size):
            return random.sample(self.buffer, sample_size)

        def add_to_buffer(self, experience):
            self.buffer.append(experience)

    def generator(number):
        return(i for i in range(number))

    def epsilon_greedy_policy(q_values, epsilon):
        number_of_actions =len(q_values)
        action_probabilites = np.ones(number_of_actions, dtype=float)*epsilon/number_of_actions
        best_action = np.argmax(q_values)
        action_probabilites[best_action]+= (1-epsilon)
        return np.random.choice(number_of_actions, p=action_probabilites)

    class DQNAgent:
        def __init__(self, env, model, gamma):
            self.env=env
            self.model=model
            self.replay_buffer=ReplayBuffer()
            self.gamma=gamma
            self.state_dim=env.observation_space.shape[0]

        def train_model(self, training_data, training_label):
            self.model.fit(training_data, training_label, batch_size=32, verbose=0)

        def predict_one(self, state):
            return self.model.predict(state.reshape(1, self.state_dim)).flatten()

        def experience_replay(self, experiences):
            import pdb; pdb.set_trace()
            states, actions, rewards, next_states=zip(*[[experience[0], experience[1], experience[2], experience[3]] for experience in experiences])
            states=np.asarray(states)
            place_holder_state=np.zeros(self.state_dim)
            next_states_ = np.asarray([(place_holder_state if next_state is None else next_state) for next_state in next_states])
            q_values_for_states=self.model.predict(states)
            q_values_for_next_states=self.model.predict(next_states_)
            for x in generator(len(experiences)):
                y_true=rewards[x]
                if next_states[x].any():
                    y_true +=self.gamma*(np.amax(q_values_for_next_states[x]))
                q_values_for_states[x][actions[x]]=y_true
            self.train_model(states, q_values_for_states)

        def fit(self, number_of_epsiodes, batch_size):
            for _ in generator(number_of_epsiodes):
                total_reward=0
                state=env.reset()
                while True:
                    #self.env.render()
                    q_values_for_state=self.predict_one(state)
                    action=epsilon_greedy_policy(q_values_for_state, 0.1)
                    next_state, reward, done, _=env.step(action)
                    self.replay_buffer.add_to_buffer([state, action, reward, next_state])
                    state = next_state
                    total_reward += reward
                    if len(self.replay_buffer.buffer) > 50:
                        experience=self.replay_buffer.sample(batch_size)
                        self.experience_replay(experience)
                    if done:
                       break
                print("Total reward:", total_reward)


    env = gym.make('CartPole-v0')
    model=create_model(env.observation_space.shape[0], env.action_space.n)
    agent=DQNAgent(env, model, 0.99)
    agent.fit(100000, 32)'

The error lies in these two lines

    q_values_for_states=self.model.predict(states)
    q_values_for_next_states=self.model.predict(next_states_)

You have the same network for Q and for its target. In the DQN paper, the authors use two separate networks and update the target network every X steps by copying the Q-network weights.
The correct equations are (pseudocode)

    T = R + gamma * max(QT(next_state))  # target
    E = T - Q(state)                     # error

So your equations should be

    q_values_for_states=self.model.predict(states)
    q_values_for_next_states=self.target_model.predict(next_states_)

And then you update target_model . In more recent papers (for example the DDPG one), instead of copying the weights every X steps, they perform a soft update every state, that is

    QT_weights = tau*Q_weights + (1-tau)*QT_weights

What you are doing, instead, is like updating the target network every step. This makes the algorithm very unstable, as the authors of DQN state in their paper.

Also, I would increase the minimum number of samples used for learning. You start learning when only 50 samples are collected, which are way too few. In the paper they use way more and for the cart pole I would wait for 1000 samples to be collected (consider that you should balance the pole for at least 1000 steps or something).

In the fit function I had to add

if done:
   next_state = None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM