简体繁体 English

如何在实践中实施蒙特卡洛树搜索

[英]How is Monte Carlo Tree Search implemented in practice

原文 2018-03-21 02:31:20 3 1 algorithm/ artificial-intelligence/ simulation/ montecarlo/ monte-carlo-tree-search

I understand, to a certain degree, how the algorithm works. 我在一定程度上理解算法的工作原理。 What I don't fully understand is how the algorithm is actually implemented in practice. 我不完全理解的是算法是如何在实践中实际实现的。

I'm interested in understanding what optimal approaches would be for a fairly complex game (maybe chess). 我有兴趣了解一个相当复杂的游戏（也许是象棋）的最佳方法。 ie recursive approach? 即递归方法？ async? 异步？ concurrent? 同时？ parallel? 平行？ distributed? 分散式？ data structures and/or database(s)? 数据结构和/或数据库？

-- What type of limits would we expect to see on a single machine? - 我们期望在一台机器上看到什么类型的限制？ (could we run concurrently across many cores... gpu maybe?) （我们可以同时在多个核心上运行......也许是gpu？）

-- If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? - 如果每个分支都会产生一个全新的游戏，（这可能达到数百万）我们如何保持整个系统的稳定？ & how can we reuse branches already played? 我们如何重用已经播放的分支？

1 个解决方案

recursive approach? 递归方法？ async? 异步？ concurrent? 同时？ parallel? 平行？ distributed? 分散式？ data structures and/or database(s) 数据结构和/或数据库

In MCTS, there's not much of a point in a recursive implementation (which is common in other tree search algorithms like the minimax-based ones), because you always go "through" a game in sequences from current game state (root node) till game states you choose to evaluate (terminal game states, unless you choose to go with a non-standard implementation using a depth limit on the play-out phase and a heuristic evaluation function). 在MCTS中，在递归实现中没有太多意义（这在其他树搜索算法中很常见，例如基于minimax的算法），因为你总是从当前游戏状态（根节点）直接“完成”游戏。您选择评估的游戏状态（终端游戏状态，除非您选择使用非标准实施，使用播放阶段的深度限制和启发式评估功能）。 The much more obvious implementation using while loops is just fine. 使用while循环的更明显的实现很好。
If it's your first time implementing the algorithm, I'd recommend just going for a single-threaded implementation first. 如果这是您第一次实现该算法，我建议您先进行单线程实现。 It is a relatively easy algorithm to parallelize though, there are multiple papers on that. 虽然这是一个相对简单的并行化算法，但有很多论文。 You can simply run multiple simulations (where simulation = selection + expansion + playout + backpropagation) in parallel. 您可以简单地并行运行多个模拟（模拟=选择+扩展+播放+反向传播）。 You can try to make sure everything gets updated cleanly during backpropagation, but you can also simply decide to not use any locks / blocking etc. at all, there's already enough randomness in all the simulations anyway so if you lose information from a couple of simulations here and there due to naively-implemented parallelization it really doesn't hurt too much. 您可以尝试确保在反向传播期间干净地更新所有内容，但您也可以简单地决定不使用任何锁定/阻塞等，所有模拟中已经有足够的随机性，所以如果您丢失了几个模拟的信息由于天真地实现了并行化，它确实不会造成太大的伤害。
As for data structures, unlike algorithms like minimax , you actually do need to explicitly build a tree and store it in memory (it is built up gradually as the algorithm is running). 至于数据结构，与minimax算法不同，实际上你需要显式构建一个树并将其存储在内存中（它在算法运行时逐渐建立）。 So, you'll want a general tree data structure with Nodes that have a list of successor / child Nodes , and also a pointer back to the parent Node (required for backpropagation of simulation outcomes). 因此，您需要具有Nodes的通用树数据结构， Nodes具有后继/子Nodes列表，并且还需要指向父Node的指针（模拟结果的反向传播所需）。

What type of limits would we expect to see on a single machine? 我们希望在一台机器上看到什么类型的限制？ (could we run concurrently across many cores... gpu maybe?) （我们可以同时在多个核心上运行......也许是gpu？）

Running across many cores can be done yes (see point about parallelization above). 运行多个核心可以做到（参见上面关于并行化的要点）。 I don't see any part of the algorithm being particularly well-suited for GPU implementations (there are no large matrix multiplications or anything like that), so GPU is unlikely to be interesting. 我没有看到算法的任何部分特别适合GPU实现（没有大型矩阵乘法或类似的东西），因此GPU不太可能有趣。

If each branch results in a completely new game being played, (this could reach the millions) how do we keep the overall system stable? 如果每个分支都会产生一个全新的游戏，（这可能会达到数百万）我们如何保持整个系统的稳定？ & how can we reuse branches already played? 我们如何重用已经播放的分支？

In the most commonly-described implementation, the algorithm creates only one new node to store in memory per iteration/simulation in the expansion phase (the first node encountered after the Selection phase). 在最常描述的实现中，算法在扩展阶段（在选择阶段之后遇到的第一个节点）中每次迭代/模拟仅创建一个新节点以存储在存储器中。 All other game states generated in the play-out phase of the same simulation do not get any nodes to store in memory at all. 在同一模拟的播放阶段生成的所有其他游戏状态根本不会将任何节点存储在存储器中。 This keeps memory usage in check, it means your tree only grows relatively slowly (at a rate of 1 node per simulation). 这样可以检查内存使用情况，这意味着您的树只会相对缓慢地增长（每个模拟的节点速率为1个节点）。 It does mean you get slightly less re-usage of previously-simulated branches, because you don't store everything you see in memory. 它确实意味着您对先前模拟的分支的重复使用稍微减少，因为您不会存储您在内存中看到的所有内容。 You can choose to implement a different strategy for the expansion phase (for example, create new nodes for all game states generated in the play-out phase). 您可以选择为扩展阶段实施不同的策略（例如，为在播出阶段生成的所有游戏状态创建新节点）。 You'll have to carefully monitor memory usage if you do this though. 如果你这样做，你将不得不仔细监视内存使用情况。