為什么在這個深度 Q 學習 model 的開發階段得分（累積獎勵）會下降？

Question

我很難讓深度 Q 學習代理找到最佳策略。 這就是我當前的 model 在 TensorFlow 中的樣子：

model = Sequential()

model.add(Dense(units=32, activation="relu", input_dim=self.env.state.size)),
model.add(Dense(units=self.env.allActionsKeys.size, activation="softmax"))

model.compile(loss="mse", optimizer=Adam(lr=0.00075), metrics=['accuracy'])

對於我目前正在處理的問題，“self.env.state.size”等於 6，可能的操作數（“self.env.allActionsKeys.size”）為 30。

輸入向量由位組成，每個位都有不同的范圍（盡管在這個問題中差別不大）。 2位的范圍是[0,3]，其他2位[0,2]和剩余的[0,1]。 請注意，這應該是一個簡單的問題，我還針對更復雜的問題，例如輸入大小為 15 並且范圍可能會有所不同（[0,15]，[0 ,3],...)。

這就是我的火車方法的樣子：

def train(self, terminal_state):
    if len(self.replay_memory) < MIN_REPLAY_MEMORY_SIZE:
        return

    # Get MINIBATCH_SIZE random samples from replay_memory
    minibatch = random.sample(self.replay_memory, MINIBATCH_SIZE)

    # Transition: (current_state, action, reward, normalized_next_state, next_state, done)

    current_states = np.array([transition[0] for transition in minibatch])
    current_qs_minibatch = self.model.predict(current_states, batch_size=MINIBATCH_SIZE, use_multiprocessing=True)

    next_states = np.array([transition[3] for transition in minibatch])
    next_qs_minibatch = self.model.predict(next_states, batch_size=MINIBATCH_SIZE, use_multiprocessing=True)

    env_get_legal_actions = self.env.get_legal_actions
    np_max = np.max

    X = []
    y = []

    for index, (current_state, action, reward, normalized_next_state, next_state, done) in enumerate(minibatch):
        if not done:
            legalActionsIds = env_get_legal_actions(next_state)
            max_next_q = np_max(next_qs_minibatch[index][legalActionsIds])

            new_q = reward + DISCOUNT * max_next_q
        else:
            new_q = reward

        current_qs = current_qs_minibatch[index].copy()
        current_qs[action] = new_q

        X.append(current_state)
        y.append(current_qs)

    self.model.fit(np.array(X), np.array(y), batch_size=MINIBATCH_SIZE, verbose=0, shuffle=False)

其中折扣 = 0.99 和 MINIBATCH_SIZE = 64

我讀到建議對輸入向量進行歸一化，因此我測試了 2 種不同的屬性歸一化方法：min-max norm。 和 z 分數規范。 而且，由於值范圍沒有太大差異，我還沒有進行標准化測試。 這些方法都沒有被證明比其他方法更好。

發生的情況是，在一開始，在探索階段，分數隨着時間的推移而變得更好，這意味着 model 正在學習一些東西，但是在開發階段，當 epsilon 值很低並且代理貪婪地采取大部分行動時，分數急劇下降意味着它實際上沒有學到任何好的東西。

這是我的深度 Q 學習算法：

epsilon = 1

for episode in range(1, EPISODES+1):
    episode_reward = 0
    step = 0
    done = False
    current_state = env_reset()

    while not done:
        normalized_current_state = env_normalize(current_state)

        if np_random_number() > epsilon:  # Take legal action greedily
            actionsQValues = agent_get_qs(normalized_current_state)
            legalActionsIds = env_get_legal_actions(current_state)
            # Make the argmax selection among the legal actions
            action = legalActionsIds[np_argmax(actionsQValues[legalActionsIds])]
        else:  # Take random legal action
            action = env_sample()

        new_state, reward, done = env_step(action)

        episode_reward += reward

        agent_update_replay_memory((normalized_current_state, action, reward, env_normalize(new_state), new_state, done))
        agent_train(done)

        current_state = new_state
        step += 1

    # Decay epsilon
    if epsilon > MIN_EPSILON:
        epsilon *= EPSILON_DECAY

其中 EPISODES = 4000 和 EPSILON_DECAY = 0.9995。

我玩弄了所有這些超參數，但結果非常相似。 我不知道還能嘗試什么。 我在規范化方面做錯了嗎？ 還有其他更推薦的標准化方法嗎？ 問題可能出在我的神經網絡 model 不夠好嗎？

我認為讓它解決這樣一個簡單的問題應該不難，比如輸入大小為 6、output 層有 30 個節點和隱藏層有 32 個節點。

請注意，對於相同的問題，我使用大小為 14 的二進制數組對 state 使用了不同的表示形式，並且它在相同的超參數下工作正常。 那么當我使用這種其他表示時可能會出現什么問題？

Answer 1

我發現 model 實現有問題。 激活 function 不應該是softmax而是線性的。 至少，就我而言，這種方式效果要好得多。

為什么在這個深度 Q 學習 model 的開發階段得分（累積獎勵）會下降？

問題描述

1 個解決方案

解決方案1
0 已采納 2020-10-05 22:30:58

為什么在這個深度 Q 學習 model 的開發階段得分（累積獎勵）會下降？

問題描述

1 個解決方案

解決方案1 0 已采納 2020-10-05 22:30:58

解決方案1
0 已采納 2020-10-05 22:30:58