Why does the score (accumulated reward) goes down during the exploitation phase in this Deep Q-Learning model?

Question

I'm having a hard time trying to make a Deep Q-Learning agent find the optimal policy. This is how my current model looks like in TensorFlow:

model = Sequential()

model.add(Dense(units=32, activation="relu", input_dim=self.env.state.size)),
model.add(Dense(units=self.env.allActionsKeys.size, activation="softmax"))

model.compile(loss="mse", optimizer=Adam(lr=0.00075), metrics=['accuracy'])

For the problem I'm working on at the moment 'self.env.state.size' is equal 6, and the number of possible actions ('self.env.allActionsKeys.size') is 30.

The input vector consists of bits, each of them with different ranges (not too different in this problem though). The range of 2 bits is [0,3], for other 2 [0,2] and for the remaining [0,1]. Please, note that this is supposed to be a simple problem, I'm also aiming for more complicated ones where the input size would be 15 for instance and the ranges can differ a bit more than that ([0,15], [0,3],...).

This is how my train method looks like:

def train(self, terminal_state):
    if len(self.replay_memory) < MIN_REPLAY_MEMORY_SIZE:
        return

    # Get MINIBATCH_SIZE random samples from replay_memory
    minibatch = random.sample(self.replay_memory, MINIBATCH_SIZE)

    # Transition: (current_state, action, reward, normalized_next_state, next_state, done)

    current_states = np.array([transition[0] for transition in minibatch])
    current_qs_minibatch = self.model.predict(current_states, batch_size=MINIBATCH_SIZE, use_multiprocessing=True)

    next_states = np.array([transition[3] for transition in minibatch])
    next_qs_minibatch = self.model.predict(next_states, batch_size=MINIBATCH_SIZE, use_multiprocessing=True)

    env_get_legal_actions = self.env.get_legal_actions
    np_max = np.max

    X = []
    y = []

    for index, (current_state, action, reward, normalized_next_state, next_state, done) in enumerate(minibatch):
        if not done:
            legalActionsIds = env_get_legal_actions(next_state)
            max_next_q = np_max(next_qs_minibatch[index][legalActionsIds])

            new_q = reward + DISCOUNT * max_next_q
        else:
            new_q = reward

        current_qs = current_qs_minibatch[index].copy()
        current_qs[action] = new_q

        X.append(current_state)
        y.append(current_qs)

    self.model.fit(np.array(X), np.array(y), batch_size=MINIBATCH_SIZE, verbose=0, shuffle=False)

where DISCOUNT = 0.99 and MINIBATCH_SIZE = 64

I read that it's recommended to normalized the input vector so I tested 2 different attribute normalization methods: min-max norm. and z-score norm. And, since the value ranges don't differ that much I also tested without normalization. None of these methods proved to be better than the others.

What happens is that at the beginning, during the exploration phase, the score gets better over time which means the model is learning something, but then, during the exploitation phase, when the epsilon value is low and the agent takes most of the actions greedly, the score decreases drastically meaning that it actually didn't learn anything good.

This is my Deep Q-Learning algorithm:

epsilon = 1

for episode in range(1, EPISODES+1):
    episode_reward = 0
    step = 0
    done = False
    current_state = env_reset()

    while not done:
        normalized_current_state = env_normalize(current_state)

        if np_random_number() > epsilon:  # Take legal action greedily
            actionsQValues = agent_get_qs(normalized_current_state)
            legalActionsIds = env_get_legal_actions(current_state)
            # Make the argmax selection among the legal actions
            action = legalActionsIds[np_argmax(actionsQValues[legalActionsIds])]
        else:  # Take random legal action
            action = env_sample()

        new_state, reward, done = env_step(action)

        episode_reward += reward

        agent_update_replay_memory((normalized_current_state, action, reward, env_normalize(new_state), new_state, done))
        agent_train(done)

        current_state = new_state
        step += 1

    # Decay epsilon
    if epsilon > MIN_EPSILON:
        epsilon *= EPSILON_DECAY

where EPISODES = 4000 and EPSILON_DECAY = 0.9995.

I played around with all these hyper-parameters but the results are very similar. I don't know what else to try. Am I doing something wrong with normalization? Is there any other normalization method that is more recommended? Could the problem be in my neural network model that is not good enough?

I think it shouldn't be this difficult to make it work for a problem such simple as this one with an input size of 6, an output layer of 30 nodes and a hidden layer of 32.

Note that for the same problem I used a different representation of the state using a binary array of size 14 and it works fine with the same hyper-parameters. What might be the problem then when I use this other representation?

Answer 1

I discovered there was something wrong with the model implementation. The activation function should not be softmax but linear . At least, in my case it works much better this way.

Why does the score (accumulated reward) goes down during the exploitation phase in this Deep Q-Learning model?

Question

1 answers

solution1
0 ACCPTED 2020-10-05 22:30:58

Why does the score (accumulated reward) goes down during the exploitation phase in this Deep Q-Learning model?

Question

1 answers

solution1 0 ACCPTED 2020-10-05 22:30:58

solution1
0 ACCPTED 2020-10-05 22:30:58