简体   繁体   中英

The huge amount of states in q-learning calculation

I implemented a 3x3 OX game by q-learning ( it works perfectly in AI vs AI and AI vs Human), but I can't go one step further to 4x4 OX game since it will eat up all my PC memory and crash.

Here is my current problem: Access violation in huge array?

In my understanding, a 3x3 OX game has a total 3(space, white, black) ^ 9 = 19683 possible states. ( same pattern different angle still count )

For a 4x4 OX game, the total state will be 3 ^ 16 = 43,046,721

For a regular go game, 15x15 board, the total state will be 3 ^ 225 ~ 2.5 x 10^107

Q1. I want to know my calculation is correct or not. ( for 4x4 OX game, I need a 3^16 array ? )

Q2. Since I need to calculate each Q value ( for each state, each action), I need such a large number of array, is it expected? any way to avoid it?

Consider symmetries. The actual number of possible configurations is much smaller than 9^3 on a 3x3 board. For example there are basically only 3 different configurations with a single x on the board.

Rotations

There are many board configurations that should all result in the same decision made by your AI, because they are the same modulo a symmetry. For example:

x - -    - - x    - - -    - - -  
- - -    - - -    - - -    - - - 
- - -    - - -    - - x    x - - 

Those are all the same configurations. If you treat them individually you waste training time.

Mirroring

There is not only rotational symmetry, but you can also mirror the board without changing the actual situation. The following are basically all the same:

0 - x    x - 0    - - -    - - -  
- - -    - - -    - - -    - - - 
- - -    - - -    0 - x    x - 0

Exclude "cannot happen" configurations

Next consider that the game ends when one player wins. For example you have 3^3 configurations that all look like this

x 0 ?
x 0 ?    // cannot happen
x 0 ?

They can never appear in a normal match. You don't have to reserve space for them, because they simply cannot happen.

Exclude more "cannot happen"

Moreover you vastly overestimate the size of the configuration space with 9^3 because players make their turns alternating. As an example, you cannot reach a configuration like this:

x x -
x - -    // cannot happen
- - - 

How to get all needed configurations?

In a nutshell, this is how I would approach the problem:

  • define a operator< for your board
  • using the < relation you can select one representative for each set of "similar" configurations (eg the one that is < than all others in the set)
  • write a function that for a given configuration returns you the representative configuration
  • brute force iterate all possible moves (only possible moves!, ie by making alternating turns of the players only until the game is won). While doing that you
    • calculate the representative for each configuration you encounter
    • remember all representative configurations (note that they appear several times, because of symmetry)

You now have a list of all configurations modulo symmetry. During the actual game you just have to transform the board to its representative and then make the move. You can transform back to the actual configuration afterwards, if you remember how to rotate/mirror it back.

This is rather brute force. My maths is a bit rusty, otherwise I would try to get the list of representatives directly. However, this is something you only have to once for each size of the board.

If you skip reinventing the wheel, here is what have done to solve this problem:

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm.

https://arxiv.org/pdf/1312.5602v1.pdf

We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value. Alternatively we could take only game screens as input and output the Q-value for each possible action. This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.

https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

I have an enumeration scheme, but it requires an array of integers. If you can compress an array of integers to a single Q value (and back) then this might work.

First comes N, number of pieces on the board.

Then comes the array of ceil(N/2) items, the X pieces. Every number is the count of empty valid spaces from the previous X piece (or board start). IMPORTANT: space is not valid if it would result in game end. This is where 5 in a row end rule helps us reduce the domain.

Then comes the array of floor(N/2) items, the O pieces. Same logic applies as for the X array.

So for this board and 3 piece rule:

XX.
X.O
..O

we have the following array:

N: 5
X: 0 (from board start), 0 (from previous X), 0 (top right corner is invalid for X because it would end the game)
O: 2 (from board start, minus all preceding X), 2 (from previous O)

and that's the array [5, 0, 0, 0, 2, 2]. Given this array we can recreate the board above. Occurrence of small numbers is more probable than that of big numbers. In regular game with 19x19 board the pieces will group together for the most part, so there will be a lot of zeros, ones, twos, delimited with occasional "big" number for the next line.

You now have to compress this array using the fact that small numbers occur more than the big ones. General purpose compression algorithm may help, but some specialized may help more.

I don't know anything about q-learning, but all this here requires that q-value can have variable size. If you have to have constant size for q-value then that size would have to account for worst possible board, and that size may be so big, that it defeats the purpose of having this enumeration/compression in the first place.

We use left-to-right and top-to-bottom method to enumerate pieces, but we could also use some spiraling method that may yield even better small-to-big numbers ratio. We just have to pick the best starting point for the spiral center. But this may complicate the algorithm and waste more CPU time in the end.

Also, we don't really need the first number in the array, N. Length of the array gives this information.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM