简体繁体 English

如何在 GPU 上有效地并行化 AlphaZero？

[英]How do I effectively parallelize AlphaZero on the GPU?

原文 2019-10-04 11:44:32 1 1 python/ machine-learning/ neural-network/ pytorch/ monte-carlo-tree-search

I'm implementing a version of AlphaZero (AlphaGo's most recent incarnation) to be applied to some other domain.我正在实现一个 AlphaZero 版本（AlphaGo 的最新版本），以应用于其他领域。

The crux of the algorithm is a Monte Carlo Tree Search of the state space (CPU) interleaved with 'intuition' (probabilities) from a neural network in eval mode (GPU).该算法的关键是 state 空间 (CPU) 的蒙特卡洛树搜索与来自 eval 模式 (GPU) 神经网络的“直觉”（概率）交错。 The MCTS result is then used to train the neural network.然后使用 MCTS 结果来训练神经网络。

I already parallelized the CPU execution by launching multiple processes which each build up their own tree.我已经通过启动多个进程来并行化 CPU 执行，每个进程都建立自己的树。 This is effective and has now lead to a GPU bottleneck !这是有效的，现在已经导致GPU 瓶颈！ (nvidia-smi showing the GPU at 100% all the time) （nvidia-smi 始终以 100% 显示 GPU）

I have devised 2 strategies to parallelize GPU evaluations, however both of them have problems.我设计了两种策略来并行化 GPU 评估，但是它们都有问题。

Each process evaluates the network only on batches from its own tree.每个进程仅在其自己的树中的批次上评估网络。 In my initial naive implementation, this meant a batch size of 1. However, by refactoring some code and adding a 'virtual loss' to discourage (but not completely block) the same node from being picked twice we can get larger batches of size 1-4.在我最初的幼稚实现中，这意味着批量大小为 1。然而，通过重构一些代码并添加“虚拟损失”来阻止（但不完全阻止）同一个节点被选中两次，我们可以获得更大的批量大小 1 -4。 The problem here is that we cannot allow large delays until we evaluate the batch or accuracy suffers, so a small batch size is key here.这里的问题是，在我们评估批次或准确性受到影响之前，我们不能允许大的延迟，所以小批量是这里的关键。
Send the batches to a central "neural network worker" thread which combines and evaluates them.将批次发送到一个中央“神经网络工作者”线程，该线程组合和评估它们。 This could be done in a large batch of 32 or more, so the GPU could be used very efficiently.这可以在 32 个或更多的大批量中完成，因此 GPU 可以非常有效地使用。 The problem here is that the tree workers send CUDA tensors 'round-trip' which is not supported by PyTorch.这里的问题是树工作人员发送 CUDA 张量“往返”，PyTorch 不支持。 It is supported if I clone them first, but all that constant copying makes this approach slower than the first one.如果我先克隆它们是受支持的，但是所有不断的复制使这种方法比第一种方法慢。

I was thinking maybe a clever batching scheme that I'm not seeing could make the first approach work.我在想也许一个我没有看到的聪明的批处理方案可以使第一种方法起作用。 Using multiple GPUs could speed up the first approach too, but the kind of parallelism I want is not natively supported by PyTorch.使用多个 GPU 也可以加快第一种方法的速度，但是 PyTorch 本身并不支持我想要的那种并行性。 Maybe keeping all tensors in the NN worker and only sending ids around could improve the second approach, however the difficulty here is how to synchronize effectively to get a large batch without making the CPU threads wait too long.也许将所有张量保留在 NN worker 中并且只发送 id 可以改进第二种方法，但是这里的困难是如何有效地同步以获得大批量而不使 CPU 线程等待太久。

I found next to no information on how AlphaZero or AlphaGo Zero were parallelized in their respective papers.我几乎没有发现有关 AlphaZero 或 AlphaGo Zero 在各自论文中如何并行化的信息。 I was able to find limited information online however which lead me to improve the first approach.我能够在网上找到有限的信息，但这导致我改进了第一种方法。

I would be grateful for any advice on this, particularly if there's some point or approach I missed.如果有任何建议，我将不胜感激，特别是如果我错过了某些观点或方法。

1 个解决方案

Use tensorflow serving as an example, The prediction service could run in a different process, running a service to receive request from the worker (runs a MCTS process and send prediction request to the prediction service).以 tensorflow 为例，预测服务可以运行在不同的进程中，运行一个服务来接收来自 Worker 的请求（运行一个 MCTS 进程并将预测请求发送到预测服务）。 We can keep a dict from the socket address to the socket itself.我们可以保留一个从套接字地址到套接字本身的字典。

The prediction service could read each query body and their header(which is different for each query), we can put that headers in a queue.预测服务可以读取每个查询正文及其标题（每个查询都不同），我们可以将这些标题放入队列中。 While wait for maybe at most 100ms or the current batch is bigger than a batch size, The prediction runs.在等待最多 100 毫秒或当前批次大于批次大小时，预测运行。 After the GPU gives the results, we loop over the results and as the order is the same as the headers in the queue, we can send back the responses via the socket based on each header (can be looked up from the dict we kept above).在 GPU 给出结果后，我们循环结果，由于顺序与队列中的标题相同，我们可以根据每个 header 通过套接字发回响应（可以从我们上面保存的 dict 中查找）。

As each query comes with a different header, you can not miss the request, response and the socket.由于每个查询都带有不同的 header，因此您不能错过请求、响应和套接字。 While you can run a tensorflow serving with a GPU card while running multiple worker to keep the batch size big enough to get a larger throughput.虽然您可以运行 tensorflow 与 GPU 卡一起运行，同时运行多个工作程序以保持批处理大小足够大以获得更大的吞吐量。

I aslo find a batching mechanism here in this repo: https://github.com/richemslie/galvanise_zero我也在这个 repo 中找到了一个批处理机制： https://github.com/richemslie/galvanise_zero