简体繁体 English

蒙特卡洛模拟中“最后的良好答复”和“快速的行动价值估算”的概念是什么？

[英]What is the concept of “Last Good Reply” and “Rapid Action Value Estimation” in Monte Carlo Simulation?

原文 2016-09-30 17:30:43 4 1 artificial-intelligence/ game-physics/ montecarlo/ tree-search

I have developed a simple hex player based on Monte Carlo Tree Search for the game of Hex. 我开发了一个基于Monte Carlo Tree Search的简单十六进制播放器，用于Hex游戏。 Now I want to extend the hex player using RAVE (Rapid Action Value Estimation) and LGP (last good reply). 现在，我想使用RAVE（快速动作值估计）和LGP（最后一个很好的答复）扩展十六进制播放器。 The articles are here and here . 这些文章在这里和这里。
I was wondering if anyone here has used any of these methods to improve the tree search performance and could help me understand it? 我想知道这里是否有人使用这些方法中的任何一种来提高树搜索性能，是否可以帮助我理解它？
I also want to know why these algorithms are called AMAF (All Moves As First) heuristics? 我还想知道为什么这些算法被称为AMAF（一切先行）启发式算法？

1 个解决方案

In the realm of monte carlo tree search in games which takes advantage of reinforcement learning, there are two types of back-propagation, AMAF and UCT. 在利用强化学习的游戏中，蒙特卡洛树搜索领域中，反向传播有两种类型，即AMAF和UCT。

UCT method back-propagates the path which during selection phase it has passed. UCT方法反向传播在选择阶段通过的路径。 only nodes which during selection are met are back-propagated exactly at their states. 只有在选择期间满足的节点才在其状态下精确地向后传播。 But in AMAF , All the nodes that are met during roll_out phase are stored and in the back-propagation phase, along with nodes in the selection path, are back-propagated without considering the states. 但是在AMAF中，存储了在转出阶段遇到的所有节点，并且在反向传播阶段以及选择路径中的节点都在不考虑状态的情况下进行了反向传播。

UCT gives a very precise and local value of a (state,action) pair, but it is too slow to converge. UCT给出（状态，动作）对的非常精确的局部值，但是收敛太慢。 on the other hand AMAF heuristic converges very fast, but (state,action) pair value is too general and can't be reliable. 另一方面，AMAF启发式算法收敛非常快，但是（状态，动作）对值太笼统，不能可靠。

We can have benefits of both strategies by using a decreasing coefficient for values like this: 通过对以下值使用递减系数，我们可以从两种策略中受益：

a * UCT + (1 - a) * AMAF a * UCT +（1-a）* AMAF

This is RAVE(Rapid Action Value Stimation) heuristic. 这是RAVE（快速动作值刺激）启发式。

Last-Good-Reply is AMAF-based but could benefit from RAVE. Last-Good-Reply基于AMAF，但可以从RAVE中受益。 its general idea is that in playout phase, when we use moves against opponent moves, if these moves were successful against opponents', so we might be able to store these moves and use them in next playouts. 它的总体思路是，在淘汰赛阶段，当我们对对手的进攻使用招式时，如果这些招式成功对付对手的进攻，那么我们也许可以存储这些招式，并在下一场比赛中使用它们。