简体   繁体   中英

Deep Q-Network (DQN) to learn the game 2048 does not improve

I am trying to build a Deep Q-Network (DQN) agent that can learn to play the game 2048 . I am orientating myself on other programs and articles that are based on the game snake and it worked well ( specifically this one ).

As input state, I am only using the grid with the tiles as numpy array, and as a reward, I use (newScore-oldScore-1) to penalize moves that do not give any points at all. I know that this might not be optimal, as one might as well reward staying alive for as long as possible, but it should be okay for the first step, right? Nevertheless, I am not getting any good results whatsoever.

I've tried to tweak the model layout, the number of neurons and layers, optimizer, gamma, learning rates, rewards, etc. . I also tried ending the game after 5 moves and to optimize just for those first five moves but no matter what I do, I don't get any noticeable improvement. I've run it for thousands of games and it just doesn't get better. In fact, sometimes I get worse results than a completely random algorithm, as sometimes it just returns the same output for any input and gets stuck.

So, my question is, if I am doing anything fundamentally wrong? Do I just have a small stupid mistake somewhere? Is this the wrong approach completely? (I know the game could probably be solved pretty easily without AI, but it seemed like a little fun project)

My Jupyter notebook can be seen here Github . Sorry for the poor code quality, I'm still a beginner and I know I need to start making documentation even for fun little projects.

Some code snippets:

Input is formatted as a 1,16 numpy array, also tried normalizing the values or using only 1 and 0 for occupied and empty cells, but that did not help either. Which is why I assumed it's maybe more of a conceptual problem?

    def get_board(self):
        grid = self.driver.execute_script("return myGM.grid.cells;")
        mygrid = []
        for line in grid:
            a = [x['value'] if x != None else 0 for x in line]
            #a = [1 if x != None else 0 for x in line]
            mygrid.append(a)
        return np.array(mygrid).reshape(1,16)

The output is an index of {0,3}, representing the actions up, down, left or right and it's just the value with the highes prediction score.

prediction = agent.model.predict(old_state)
predicted_move = np.argmax(prediction)

I've tried a lot of different model architectures, but settled for a simpler network now, as I have read that unnecessary complex structures are often a problem and unneeded. However, I couldn't find any reliable source for a method, how to get the optimal layout except for experimenting, so I'd be happy to have some more suggestions there.

model = models.Sequential()
        model.add(Dense(16, activation='relu', input_dim=16))
        #model.add(Dropout(0.15))
        #model.add(Dense(50, activation='relu'))
        #model.add(Dropout(0.15))
        model.add(Dense(20, activation='relu'))
        #model.add(Dropout(0.15))
        #model.add(Dense(30, input_dim=16, activation='relu'))
        #model.add(Dropout(0.15))
        #model.add(Dense(30, activation='relu'))
        #model.add(Dropout(0.15))
        #model.add(Dense(8, activation='relu'))
        #model.add(Dropout(0.15))
        model.add(Dense(4, activation='linear'))
        opt = Adam(lr=self.learning_rate)
        model.compile(loss='mse', optimizer=opt)

Hyper Parameter Tuning is a giant time sinking rabbit hole that you should avoid. Make improvements else where.

One suggestion I would recommend is for you to grab an over-the-shelf library and use their DQN implementation to test this 2048 environment. Then compare your benchmarks and isolate the trouble spots. It'll be easier for you to check and see if it's your hyper parameters, rewards, model, memory size, etc.

At a glance here are some things that stood out: epsilon starts at 75 random range 0 - 200 It's possible your agent isn't exploring enough. My understanding is that in less than 75 attempts your agent is exploiting since you're not decaying your epsilon but subtracting 1 from it.

Reward -10 gives me wonky behaviors on some environments. Try -1.

len(memory) > 500: # Magic number -- Why 500?

Make sure you have a fixed seed when making comparisons.

What was the reasoning behind your layer sizes? Did you try 16, 16, 4 or 16, 8, 4? Did 16, 20, 4 give you a much better result?

The hardest part to read is not the code but the results your getting. I'm having a hard time seeing how much reward your agent got and when it failed/passed etc. Label your X and Y.

Try training for more than 1 epoch.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM