[英]How to leverage the world-size parameter for DistributedDataParallel in Pytorch example for multiple GPUs?
I am running this Pytorch example on a g2.2xlarge
AWS machine.我在
g2.2xlarge
AWS 机器上运行这个 Pytorch 示例。 So, when I run time python imageNet.py ImageNet2
, it runs well with the following timing:因此,当我运行
time python imageNet.py ImageNet2
,它在以下时间运行良好:
real 3m16.253s
user 1m50.376s
sys 1m0.872s
However, when I add the world-size
parameter, it gets stuck and does not execute anything.但是,当我添加
world-size
参数时,它会卡住并且不执行任何操作。 The command is as follows: time python imageNet.py --world-size 2 ImageNet2
命令如下:
time python imageNet.py --world-size 2 ImageNet2
So, how do I leverage the DistributedDataParallel
functionality with the world-size
parameter in this script.那么,我如何利用此脚本中的
world-size
参数来利用DistributedDataParallel
功能。 The world-size parameter is nothing but number of distributed processes . world-size 参数只不过是分布式进程的数量。
Do I spin up another similar instance for this purpose?我是否为此启动了另一个类似的实例? If yes, then how do the script recognize the instance?
如果是,那么脚本如何识别实例? Do I need to add some parameters like the instance's IP or something?
我是否需要添加一些参数,例如实例的 IP 之类的?
World size argument is the number of nodes in your distributed training, so if you set the world size to 2 you need to run the same command with a different rank on the other node. World size 参数是分布式训练中的节点数,因此如果将 world size 设置为 2,则需要在另一个节点上以不同的等级运行相同的命令。 If you just want to increase the number of GPUs on a single node you need to change
ngpus_per_node
instead.如果您只想增加单个节点上的 GPU 数量,则需要更改
ngpus_per_node
。 Take a look at the multiple node example in this Readme .查看此自述文件中的多节点示例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.