q学习计算中的大量状态

Question

I implemented a 3x3 OX game by q-learning ( it works perfectly in AI vs AI and AI vs Human), but I can't go one step further to 4x4 OX game since it will eat up all my PC memory and crash. 我通过q-learning实现了3x3 OX游戏（在AI vs AI和AI vs Human上都可以完美运行），但是我无法再前进到4x4 OX游戏，因为它将耗尽我所有的PC内存并崩溃。

Here is my current problem: Access violation in huge array? 这是我当前的问题：大量访问冲突？

In my understanding, a 3x3 OX game has a total 3(space, white, black) ^ 9 = 19683 possible states. 以我的理解，一个3x3 OX游戏共有3（空格，白色，黑色）^ 9 = 19683个可能的状态。 ( same pattern different angle still count ) （相同图案不同角度仍算）

For a 4x4 OX game, the total state will be 3 ^ 16 = 43,046,721 对于4x4 OX游戏，总状态为3 ^ 16 = 43,046,721

For a regular go game, 15x15 board, the total state will be 3 ^ 225 ~ 2.5 x 10^107 对于15x15的常规围棋游戏，总状态为3 ^ 225〜2.5 x 10 ^ 107

Q1. Q1。 I want to know my calculation is correct or not. 我想知道我的计算是否正确。 ( for 4x4 OX game, I need a 3^16 array ? ) （对于4x4 OX游戏，我需要3 ^ 16数组吗？）

Q2. Q2。 Since I need to calculate each Q value ( for each state, each action), I need such a large number of array, is it expected? 由于我需要计算每个Q值（针对每个状态，每个动作），因此我需要大量的数组，这是预期的吗？ any way to avoid it? 有什么办法避免呢？

Answer 1

Consider symmetries. 考虑对称性。 The actual number of possible configurations is much smaller than 9^3 on a 3x3 board. 实际可能的配置数量比3x3板上的9 ^ 3小得多。 For example there are basically only 3 different configurations with a single x on the board. 例如，基本上只有3种不同的配置，且板上只有一个x 。

Rotations 旋转

There are many board configurations that should all result in the same decision made by your AI, because they are the same modulo a symmetry. 有许多电路板配置都应导致AI做出相同的决定，因为它们在对称性上相同。 For example: 例如：

x - -    - - x    - - -    - - -  
- - -    - - -    - - -    - - - 
- - -    - - -    - - x    x - -

Those are all the same configurations. 这些都是相同的配置。 If you treat them individually you waste training time. 如果单独对待它们，则会浪费训练时间。

Mirroring 镜像

There is not only rotational symmetry, but you can also mirror the board without changing the actual situation. 不仅存在旋转对称性，而且还可以在不更改实际情况的情况下对板进行镜像。 The following are basically all the same: 以下基本上是相同的：

0 - x    x - 0    - - -    - - -  
- - -    - - -    - - -    - - - 
- - -    - - -    0 - x    x - 0

Exclude "cannot happen" configurations 排除“无法发生”配置

Next consider that the game ends when one player wins. 接下来考虑当一名玩家获胜时游戏结束。 For example you have 3^3 configurations that all look like this 例如，您有3 ^ 3个配置，看起来都像这样

x 0 ?
x 0 ?    // cannot happen
x 0 ?

They can never appear in a normal match. 他们永远不会出现在正常比赛中。 You don't have to reserve space for them, because they simply cannot happen. 您不必为它们保留空间，因为它们根本不可能发生。

Exclude more "cannot happen" 排除更多“无法发生”

Moreover you vastly overestimate the size of the configuration space with 9^3 because players make their turns alternating. 此外，由于玩家轮流交替进行游戏，因此您以9^3大大高估了配置空间的大小。 As an example, you cannot reach a configuration like this: 例如，您无法达到以下配置：

x x -
x - -    // cannot happen
- - -

How to get all needed configurations? 如何获得所有需要的配置？

In a nutshell, this is how I would approach the problem: 简而言之，这就是我要解决的问题：

define a operator< for your board 为您的董事会定义一个operator<
using the < relation you can select one representative for each set of "similar" configurations (eg the one that is < than all others in the set) 使用<关系可以选择一个代表对每一组的“相似的”配置（例如，一个是<比组中的所有其他）
write a function that for a given configuration returns you the representative configuration 编写一个函数，以给定的配置返回给您代表性的配置
brute force iterate all possible moves (only possible moves!, ie by making alternating turns of the players only until the game is won). 蛮力迭代所有可能的动作（仅可能的动作！，即仅在玩家获胜之前交替轮流玩家）。 While doing that you 在这样做的时候
- calculate the representative for each configuration you encounter 计算您遇到的每种配置的代表
- remember all representative configurations (note that they appear several times, because of symmetry) 记住所有具有代表性的配置（请注意，由于对称，它们出现了几次）

You now have a list of all configurations modulo symmetry. 现在，您将获得所有对称模配置的列表。 During the actual game you just have to transform the board to its representative and then make the move. 在实际游戏中，您只需要将棋盘转换为其代表，然后进行移动即可。 You can transform back to the actual configuration afterwards, if you remember how to rotate/mirror it back. 如果您记得如何旋转/镜像它，则可以在之后转换回实际配置。

This is rather brute force. 这是蛮力的。 My maths is a bit rusty, otherwise I would try to get the list of representatives directly. 我的数学有点生疏，否则我将尝试直接获取代表名单。 However, this is something you only have to once for each size of the board. 但是，对于每种尺寸的电路板，您只需要做一次。

Answer 2

If you skip reinventing the wheel, here is what have done to solve this problem: 如果您跳过重新发明轮子的方法，那么可以解决以下问题：

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. 该模型是一个卷积神经网络，它经过Q学习的变体训练，其输入为原始像素，其输出为估计未来回报的价值函数。 We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. 我们将我们的方法应用于Arcade学习环境中的七个Atari 2600游戏，而无需调整体系结构或学习算法。

https://arxiv.org/pdf/1312.5602v1.pdf https://arxiv.org/pdf/1312.5602v1.pdf

We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value. 我们可以用神经网络表示Q函数，该神经网络将状态（四个游戏屏幕）和动作作为输入并输出相应的Q值。 Alternatively we could take only game screens as input and output the Q-value for each possible action. 或者，我们可以仅将游戏屏幕作为输入，并输出每个可能动作的Q值。 This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available. 这种方法的优势在于，如果我们要执行Q值更新或选择具有最高Q值的动作，则只需通过网络进行一次前向传递，即可立即获得所有动作的所有Q值。

https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

Answer 3

I have an enumeration scheme, but it requires an array of integers. 我有一个枚举方案，但它需要一个整数数组。 If you can compress an array of integers to a single Q value (and back) then this might work. 如果您可以将整数数组压缩为单个Q值（然后压缩），则可能会起作用。

First comes N, number of pieces on the board. 首先是N，表示棋盘上的棋子数。

Then comes the array of ceil(N/2) items, the X pieces. 然后是ceil（N / 2）个项的数组，X个。 Every number is the count of empty valid spaces from the previous X piece (or board start). 每个数字都是从上一个X片段（或板开始）开始的空白有效空间的数量。 IMPORTANT: space is not valid if it would result in game end. 重要说明：如果空格将导致游戏结束，则该空格无效。 This is where 5 in a row end rule helps us reduce the domain. 这是连续5个结束规则可以帮助我们缩小域的地方。

Then comes the array of floor(N/2) items, the O pieces. 然后是Floor（N / 2）个项目的数组，O个。 Same logic applies as for the X array. 逻辑与X数组相同。

So for this board and 3 piece rule: 因此，对于该板和3条规则：

XX.
X.O
..O

we have the following array: 我们有以下数组：

N: 5 N：5
X: 0 (from board start), 0 (from previous X), 0 (top right corner is invalid for X because it would end the game) X：0（从棋盘开始），0（从先前的X开始），0（对X无效，因为X会结束游戏）
O: 2 (from board start, minus all preceding X), 2 (from previous O) O：2（从木板开始，减去前面的所有X），2（从先前的O开始）

and that's the array [5, 0, 0, 0, 2, 2]. 那就是数组[5，0，0，0，2，2]。 Given this array we can recreate the board above. 给定这个数组，我们可以重新创建上面的板。 Occurrence of small numbers is more probable than that of big numbers. 小数字的发生比大数字的可能性更大。 In regular game with 19x19 board the pieces will group together for the most part, so there will be a lot of zeros, ones, twos, delimited with occasional "big" number for the next line. 在具有19x19棋盘的常规游戏中，棋子大部分会组合在一起，因此下一行将有很多零，一，二，并以“大”号分隔。

You now have to compress this array using the fact that small numbers occur more than the big ones. 现在，您必须使用较小的数字比较大的数字出现更多的事实来压缩该数组。 General purpose compression algorithm may help, but some specialized may help more. 通用压缩算法可能会有所帮助，但某些专门的压缩算法可能会有所帮助。

I don't know anything about q-learning, but all this here requires that q-value can have variable size. 我对q学习一无所知，但是这里所有这些都要求q值可以具有可变大小。 If you have to have constant size for q-value then that size would have to account for worst possible board, and that size may be so big, that it defeats the purpose of having this enumeration/compression in the first place. 如果必须将q值的大小保持不变，则该大小必须考虑最差的板，并且该大小可能太大，以至于无法实现首先进行枚举/压缩的目的。

We use left-to-right and top-to-bottom method to enumerate pieces, but we could also use some spiraling method that may yield even better small-to-big numbers ratio. 我们使用从左到右和从上到下的方法来枚举片段，但是我们也可以使用一些螺旋方法来产生更好的小到大数字比率。 We just have to pick the best starting point for the spiral center. 我们只需要为螺旋中心选择最佳起点。 But this may complicate the algorithm and waste more CPU time in the end. 但这可能会使算法复杂化，并最终浪费更多的CPU时间。

Also, we don't really need the first number in the array, N. Length of the array gives this information. 另外，我们实际上并不需要数组中的第一个数字N。数组的长度提供了此信息。

q学习计算中的大量状态

问题描述

3 个解决方案

解决方案1
3 2019-05-21 18:46:28

解决方案2
1 已采纳 2019-06-19 09:06:01

解决方案3
0 2019-05-23 10:02:42

q学习计算中的大量状态

问题描述

3 个解决方案

解决方案1 3 2019-05-21 18:46:28

解决方案2 1 已采纳 2019-06-19 09:06:01

解决方案3 0 2019-05-23 10:02:42

解决方案1
3 2019-05-21 18:46:28

解决方案2
1 已采纳 2019-06-19 09:06:01

解决方案3
0 2019-05-23 10:02:42