深度Q-Network（DQN）学习游戏2048不提升

Question

I am trying to build a Deep Q-Network (DQN) agent that can learn to play the game 2048 .我正在尝试构建一个可以学习玩游戏2048的 Deep Q-Network (DQN) 代理。 I am orientating myself on other programs and articles that are based on the game snake and it worked well ( specifically this one ).我正在将自己定位于基于游戏蛇的其他程序和文章，并且效果很好（特别是这个）。

As input state, I am only using the grid with the tiles as numpy array, and as a reward, I use (newScore-oldScore-1) to penalize moves that do not give any points at all.作为输入状态，我只使用带有瓷砖的网格作为 numpy 数组，作为奖励，我使用 (newScore-oldScore-1) 来惩罚根本不给分的动作。 I know that this might not be optimal, as one might as well reward staying alive for as long as possible, but it should be okay for the first step, right?我知道这可能不是最优的，因为最好能奖励尽可能长时间地活着，但第一步应该没问题，对吧？ Nevertheless, I am not getting any good results whatsoever.然而，我没有得到任何好的结果。

I've tried to tweak the model layout, the number of neurons and layers, optimizer, gamma, learning rates, rewards, etc. .我试图调整模型布局、神经元和层的数量、优化器、伽玛、学习率、奖励等。 I also tried ending the game after 5 moves and to optimize just for those first five moves but no matter what I do, I don't get any noticeable improvement.我还尝试在 5 步后结束游戏并仅针对前 5 步进行优化，但无论我做什么，我都没有得到任何明显的改进。 I've run it for thousands of games and it just doesn't get better.我已经运行了数千场比赛，但它并没有变得更好。 In fact, sometimes I get worse results than a completely random algorithm, as sometimes it just returns the same output for any input and gets stuck.事实上，有时我会得到比完全随机算法更糟糕的结果，因为有时它只是为任何输入返回相同的输出并卡住。

So, my question is, if I am doing anything fundamentally wrong?所以，我的问题是，如果我从根本上做错了什么？ Do I just have a small stupid mistake somewhere?我只是在某个地方犯了一个愚蠢的小错误吗？ Is this the wrong approach completely?这是完全错误的方法吗？ (I know the game could probably be solved pretty easily without AI, but it seemed like a little fun project) （我知道没有 AI 游戏可能很容易解决，但这似乎是一个有趣的项目）

My Jupyter notebook can be seen here Github .我的 Jupyter 笔记本可以在Github上看到。 Sorry for the poor code quality, I'm still a beginner and I know I need to start making documentation even for fun little projects.抱歉代码质量很差，我还是个初学者，我知道即使是有趣的小项目，我也需要开始制作文档。

Some code snippets:一些代码片段：

Input is formatted as a 1,16 numpy array, also tried normalizing the values or using only 1 and 0 for occupied and empty cells, but that did not help either.输入格式为 1,16 numpy 数组，还尝试对值进行标准化或仅使用 1 和 0 来表示已占用和空单元格，但这也无济于事。 Which is why I assumed it's maybe more of a conceptual problem?这就是为什么我认为这可能更多是一个概念问题？

    def get_board(self):
        grid = self.driver.execute_script("return myGM.grid.cells;")
        mygrid = []
        for line in grid:
            a = [x['value'] if x != None else 0 for x in line]
            #a = [1 if x != None else 0 for x in line]
            mygrid.append(a)
        return np.array(mygrid).reshape(1,16)

The output is an index of {0,3}, representing the actions up, down, left or right and it's just the value with the highes prediction score.输出是 {0,3} 的索引，代表向上、向下、向左或向右的动作，它只是具有最高预测分数的值。

prediction = agent.model.predict(old_state)
predicted_move = np.argmax(prediction)

I've tried a lot of different model architectures, but settled for a simpler network now, as I have read that unnecessary complex structures are often a problem and unneeded.我已经尝试了很多不同的模型架构，但现在选择了一个更简单的网络，因为我读到不必要的复杂结构通常是一个问题并且不需要。 However, I couldn't find any reliable source for a method, how to get the optimal layout except for experimenting, so I'd be happy to have some more suggestions there.但是，我找不到任何可靠的方法来源，除了实验之外如何获得最佳布局，所以我很乐意在那里提供更多建议。

model = models.Sequential()
        model.add(Dense(16, activation='relu', input_dim=16))
        #model.add(Dropout(0.15))
        #model.add(Dense(50, activation='relu'))
        #model.add(Dropout(0.15))
        model.add(Dense(20, activation='relu'))
        #model.add(Dropout(0.15))
        #model.add(Dense(30, input_dim=16, activation='relu'))
        #model.add(Dropout(0.15))
        #model.add(Dense(30, activation='relu'))
        #model.add(Dropout(0.15))
        #model.add(Dense(8, activation='relu'))
        #model.add(Dropout(0.15))
        model.add(Dense(4, activation='linear'))
        opt = Adam(lr=self.learning_rate)
        model.compile(loss='mse', optimizer=opt)

Answer 1

Hyper Parameter Tuning is a giant time sinking rabbit hole that you should avoid.超参数调整是一个巨大的时间沉没兔子洞，您应该避免。 Make improvements else where.在其他地方进行改进。

One suggestion I would recommend is for you to grab an over-the-shelf library and use their DQN implementation to test this 2048 environment.我推荐的一个建议是让您获取一个现成的库并使用他们的 DQN 实现来测试这个 2048 环境。 Then compare your benchmarks and isolate the trouble spots.然后比较您的基准并隔离故障点。 It'll be easier for you to check and see if it's your hyper parameters, rewards, model, memory size, etc.您可以更轻松地检查是否是您的超参数、奖励、模型、内存大小等。

At a glance here are some things that stood out: epsilon starts at 75 random range 0 - 200 It's possible your agent isn't exploring enough.一目了然，这里有一些突出的东西： epsilon 从 75 随机范围开始 0 - 200 可能您的代理探索不够。 My understanding is that in less than 75 attempts your agent is exploiting since you're not decaying your epsilon but subtracting 1 from it.我的理解是，在不到 75 次尝试中，您的代理正在利用，因为您没有衰减 epsilon 而是从中减去 1。

Reward -10 gives me wonky behaviors on some environments.奖励 -10 在某些环境中给了我不稳定的行为。 Try -1.尝试-1。

len(memory) > 500: # Magic number -- Why 500? len(memory) > 500: # 幻数——为什么是 500？

Make sure you have a fixed seed when making comparisons.进行比较时，请确保您有一个固定的种子。

What was the reasoning behind your layer sizes?你的层大小背后的原因是什么？ Did you try 16, 16, 4 or 16, 8, 4?您是否尝试过 16、16、4 或 16、8、4？ Did 16, 20, 4 give you a much better result? 16、20、4 是否给了你更好的结果？

The hardest part to read is not the code but the results your getting.最难阅读的部分不是代码，而是你得到的结果。 I'm having a hard time seeing how much reward your agent got and when it failed/passed etc. Label your X and Y.我很难看到您的代理获得了多少奖励以及何时失败/通过等。标记您的 X 和 Y。

Try training for more than 1 epoch.尝试训练超过 1 个 epoch。

深度Q-Network（DQN）学习游戏2048不提升

问题描述

1 个解决方案

解决方案1
1 2019-06-14 23:46:17

深度Q-Network（DQN）学习游戏2048不提升

问题描述

1 个解决方案

解决方案1 1 2019-06-14 23:46:17

解决方案1
1 2019-06-14 23:46:17