简体繁体 English

棋类游戏中的蒙特卡洛树搜索-如何实施对手移动

[英]Monte Carlo Tree Search in board games - How to Implement Opponent Moves

原文 2018-01-07 15:47:48 3 1 algorithm/ search/ machine-learning/ tree/ montecarlo

I am working on an implementation of the MCTS algorithm, in the context of zero-sum board games with perfect information. 我正在以完美信息零和棋盘游戏为背景，研究MCTS算法的实现。 Eg Chess, Go, Checkers. 例如国际象棋，围棋，跳棋。

As I understand, in each iteration of the algorithm, there are four steps: selection, expansion, simulation, and backpropagation. 据我了解，在算法的每次迭代中，都有四个步骤：选择，扩展，仿真和反向传播。

My question is about the implementation of the opponent moves, how it should be presented in the tree, and how it should be implemented at each stage. 我的问题是关于对手动作的实施，如何在树中呈现以及如何在每个阶段实施。

For example, let's imagine a game of GO, where we(black) are playing against an AI(white). 例如，让我们想象一个GO游戏，我们（黑色）与AI（白色）对战。 When black make an action a _b from the root node s ₀ , it is then turn for white to make an action a _w . 当黑色从根节点s ₀进行动作a _b时，则转向白色进行动作a _w 。

My initial thought was that each action will produce a new state. 我最初的想法是，每个动作都会产生一个新状态。 So s ₀ -> a _b -> s ₁ -> a _w -> s ₂ , where each s state would represent a node. 所以s _0- > a _b- > s _1- > a _w- > s ₂ ，其中每个s状态将代表一个节点。 However, this would effect the selection process in MCTS. 但是，这将影响MCTS中的选择过程。 In this case, wouldn't MCTS have a tendency to explore bad a _w moves? 在这种情况下，不会有MCTS探索坏_W¯¯移动的倾向？ Since this will return better rewards for black. 因为这将为黑色返回更好的奖励。

The alternative solution I though was to combin the actions into a single node. 不过，我的替代解决方案是将操作组合到单个节点中。 So s ₀ -> a _b -> a _w -> s ₁ . 所以s _0- > a _b- > a _w- > s ₁ However, this would make the decision making more difficult, since each root level action is now associated with multiple different node. 但是，这将使决策更加困难，因为每个根级别操作现在都与多个不同的节点关联。

Is there any framework which suggests how opponents should be represented in MCTS? 是否有任何框架建议如何在MCTS中代表对手？ Any help would be appreciated. 任何帮助，将不胜感激。

Edit 1: Since we will be playing black in the sample above, the reward function at the end of each simulation will be with respect to black. 编辑1：由于我们将在上面的示例中使用黑色，因此每次模拟结束时的奖励功能将针对黑色。 Eg if black wins at the end of the game, the reward will be back up through all nodes, both black and white nodes. 例如，如果黑人在游戏结束时获胜，则奖励将通过所有节点（包括黑色和白色节点）进行备份。 My expectation was that white node (which allowed black to win) which have high state value. 我的期望是状态值很高的白色节点（允许黑色获胜）。

But maybe I should flip the reward when doing backpropagation? 但是也许我应该在进行反向传播时翻转奖励？ Eg if black wins, its 1 for black node and -1 for the white node. 例如，如果黑色获胜，则黑色节点为1，白色节点为-1。 This way, the selection function stays the same. 这样，选择功能保持不变。 Would this be correct? 这是正确的吗？

1 个解决方案

You should run either against a known strong opponent or against the algorithm itself. 您应该与已知的强敌或算法本身对抗。

Assuming you run against your own algorithm, feed the data into it to figure out the "best" move. 假设您使用自己的算法运行，则将数据输入其中以找出“最佳”方法。 Make sure the algorithm works for the intended side(ie if you play go/chess, the easiest thing is to swap the colors of the game pieces). 确保算法适合预期的一面（即，如果您玩围棋/棋子，最简单的方法就是交换游戏棋子的颜色）。

If you play against yourself you basically generate twice as many data points for learning the game. 如果您与自己对战，则基本上会产生两倍于学习游戏的数据点。

If you are just starting out it might be worth playing against some other machine player. 如果您只是开始，那么可能值得与其他机器玩家一起玩。 You don't get so much data points but the ones you get teach you more (ie a bad move will be learned faster). 您没有得到太多的数据点，但是获得的数据点可以教给您更多（即，错误的举动将更快地被学习）。

You probably want to start by playing against some reasonable, existing AI then switch to play against itself. 您可能想从与一些合理的现有AI对抗开始，然后切换至与自身对抗。