简体繁体 English

为什么蒙特卡罗树搜索重置树

[英]Why does Monte Carlo Tree Search reset Tree

原文 2017-11-20 10:18:48 0 2 algorithm/ artificial-intelligence/ montecarlo

I had a small but potentially stupid question about Monte Carlo Tree Search . 关于蒙特卡罗树搜索，我有一个小但可能很愚蠢的问题。 I understand most of it but have been looking at some implementations and noticed that after the MCTS is run for a given state and a best move returned, the tree is thrown away. 我理解其中的大部分内容，但一直在查看一些实现，并注意到在MCTS针对给定状态运行并且返回了最佳移动之后，该树被丢弃。 So for the next move, we have to run MCTS from scratch on this new state to get the next best position. 因此，对于下一步行动，我们必须在这个新状态下从零开始运行MCTS以获得下一个最佳位置。

I was just wondering why we don't retain some of the information from the old tree. 我只是想知道为什么我们不保留旧树的一些信息。 It seems like there is valuable information about the states in the old tree, especially given that the best move is one where the MCTS has explored most. 似乎有关于旧树中状态的有价值信息，特别是考虑到最佳移动是MCTS最常探索的移动。 Is there any particular reason we can't use this old information in some useful way? 有什么特别的原因我们不能以某种有用的方式使用这些旧信息吗？

2 个解决方案

Some implementations do indeed retain the information. 一些实现确实保留了信息。

For example, the AlphaGo Zero paper says: 例如， AlphaGo Zero论文说：

The search tree is reused at subsequent time-steps: the child node corresponding to the played action becomes the new root node; 在随后的时间步骤中重用搜索树：对应于播放的动作的子节点成为新的根节点; the subtree below this child is retained along with all its statistics, while the remainder of the tree is discarded 此子项下面的子树及其所有统计信息都会保留，而树的其余部分将被丢弃

Well the reason may be the following. 那么原因可能如下。

Rollouts are truncated value estimations, contribution after maximum length are discarded. 推出是截断值估计，丢弃最大长度后的贡献。

Assume that maximum rollout depth is N. 假设最大滚出深度为N.

If you consider an environment where average reward is !=0 (let's say >0). 如果你考虑一个平均奖励的环境！= 0（假设> 0）。

After an action is taken and observation is obtained a child node of the tree could be selected. 在采取动作并获得观察之后，可以选择树的子节点。

Now the maximum length of the branches and the maximum length of the rollout that partecipated to the evaluation of a node value is N-1, as the root node has been discarded. 现在，分支的最大长度和分离到节点值评估的卷展栏的最大长度是N-1，因为根节点已被丢弃。

However, the new simulations will obviously still have length N but they will have to be combined with simulations of length N-1. 然而，新的模拟显然仍然具有长度N，但它们必须与长度为N-1的模拟相结合。

Longer simulations will have a biased value as the average reward is !=0 较长的模拟将具有偏差值，因为平均奖励是！= 0

This means that the nodes are evaluated with mixed length evaluation will have a bias depending on the ratio of simulations with different lengths.. 这意味着使用混合长度评估评估节点将具有取决于具有不同长度的模拟的比率的偏差。

Another reason why recycling old simulations with shorter length is avoided is because of the bias induced on the sampling. 避免回收长度较短的旧模拟的另一个原因是由于采样引起的偏差。 Just imagine a T maze where at depth d on the left there is a maximum reward =R/2 while at depth=d+1 there is a maximum reward = R on the right. 想象一下T迷宫，其中左边的深度d有最大奖励= R / 2，而在深度= d + 1时，右边有一个最大奖励= R. All the paths to the left that during the first step were able to reach the R/2 reward at depth d will be favoured during the second step with a recycled tree while paths to the right will be less common and there will higher chance to not reach the reward R. Starting from an empty tree will give the same probability to both sides of the maze. 左边的所有路径在第一步中都能够达到深度为d的R / 2奖励，在第二步中使用再生树将受到青睐，而右边的路径将不太常见，并且将有更高的机会达到奖励R.从空树开始将给迷宫两侧提供相同的概率。

Alpha Go Zero (see Peter de Rivaz's answer) actually does not use rollouts but uses a value approaximation (generated by a deep network). Alpha Go Zero（参见Peter de Rivaz的回答）实际上不使用推出但使用值approaximation（由深度网络生成）。 values are not truncated estimations. 值不是截断的估计值。 Thus Alpha Go Zero is not affected by this branch length bias. 因此，Alpha Go Zero不受此分支长度偏差的影响。

Alpha Go, the predecessor of Alpha Go Zero, combined rollouts and the value approximation and also reused the tree.. but no the new version does not use the rollouts.. maybe for this reason. Alpha Go Zero的前身Alpha Go将滚动和值近似结合起来并重新使用了树..但是没有新版本不使用推出......可能就是这个原因。 Also both Alpha Go Zero and Alpha Go do not use the value of the action but the number of times it was selected during search. Alpha Go Zero和Alpha Go也不使用动作的值，而是使用搜索期间选择的次数。 This value may be less affected by the length bias, at least in the case where the average reward is negative 该值可能受长度偏差的影响较小，至少在平均奖励为负的情况下

Hope this is clear.. 希望这很清楚..