如何在多个 GPU 的 Pytorch 示例中利用 DistributedDataParallel 的世界大小参数？

Question

I am running this Pytorch example on a g2.2xlarge AWS machine.我在g2.2xlarge AWS 机器上运行这个 Pytorch 示例。 So, when I run time python imageNet.py ImageNet2 , it runs well with the following timing:因此，当我运行time python imageNet.py ImageNet2 ，它在以下时间运行良好：

real    3m16.253s
user    1m50.376s
sys 1m0.872s

However, when I add the world-size parameter, it gets stuck and does not execute anything.但是，当我添加world-size参数时，它会卡住并且不执行任何操作。 The command is as follows: time python imageNet.py --world-size 2 ImageNet2命令如下： time python imageNet.py --world-size 2 ImageNet2

So, how do I leverage the DistributedDataParallel functionality with the world-size parameter in this script.那么，我如何利用此脚本中的world-size参数来利用DistributedDataParallel功能。 The world-size parameter is nothing but number of distributed processes . world-size 参数只不过是分布式进程的数量。

Do I spin up another similar instance for this purpose?我是否为此启动了另一个类似的实例？ If yes, then how do the script recognize the instance?如果是，那么脚本如何识别实例？ Do I need to add some parameters like the instance's IP or something?我是否需要添加一些参数，例如实例的 IP 之类的？

Answer 1

World size argument is the number of nodes in your distributed training, so if you set the world size to 2 you need to run the same command with a different rank on the other node. World size 参数是分布式训练中的节点数，因此如果将 world size 设置为 2，则需要在另一个节点上以不同的等级运行相同的命令。 If you just want to increase the number of GPUs on a single node you need to change ngpus_per_node instead.如果您只想增加单个节点上的 GPU 数量，则需要更改ngpus_per_node 。 Take a look at the multiple node example in this Readme .查看此自述文件中的多节点示例。

如何在多个 GPU 的 Pytorch 示例中利用 DistributedDataParallel 的世界大小参数？

问题描述

1 个解决方案

解决方案1
0 2020-04-01 16:27:33

如何在多个 GPU 的 Pytorch 示例中利用 DistributedDataParallel 的世界大小参数？

问题描述

1 个解决方案

解决方案1 0 2020-04-01 16:27:33

解决方案1
0 2020-04-01 16:27:33