简体   繁体   English

如何在多个 GPU 的 Pytorch 示例中利用 DistributedDataParallel 的世界大小参数?

[英]How to leverage the world-size parameter for DistributedDataParallel in Pytorch example for multiple GPUs?

I am running this Pytorch example on a g2.2xlarge AWS machine.我在g2.2xlarge AWS 机器上运行这个 Pytorch 示例 So, when I run time python imageNet.py ImageNet2 , it runs well with the following timing:因此,当我运行time python imageNet.py ImageNet2 ,它在以下时间运行良好:

real    3m16.253s
user    1m50.376s
sys 1m0.872s

However, when I add the world-size parameter, it gets stuck and does not execute anything.但是,当我添加world-size参数时,它会卡住并且不执行任何操作。 The command is as follows: time python imageNet.py --world-size 2 ImageNet2命令如下: time python imageNet.py --world-size 2 ImageNet2

So, how do I leverage the DistributedDataParallel functionality with the world-size parameter in this script.那么,我如何利用此脚本中的world-size参数来利用DistributedDataParallel功能。 The world-size parameter is nothing but number of distributed processes . world-size 参数只不过是分布式进程的数量

Do I spin up another similar instance for this purpose?我是否为此启动了另一个类似的实例? If yes, then how do the script recognize the instance?如果是,那么脚本如何识别实例? Do I need to add some parameters like the instance's IP or something?我是否需要添加一些参数,例如实例的 IP 之类的?

World size argument is the number of nodes in your distributed training, so if you set the world size to 2 you need to run the same command with a different rank on the other node. World size 参数是分布式训练中的节点数,因此如果将 world size 设置为 2,则需要在另一个节点上以不同的等级运行相同的命令。 If you just want to increase the number of GPUs on a single node you need to change ngpus_per_node instead.如果您只想增加单个节点上的 GPU 数量,则需要更改ngpus_per_node Take a look at the multiple node example in this Readme .查看此自述文件中的多节点示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 pytorch 中使用多个 GPU? - How to use multiple GPUs in pytorch? 如何在pytorch中用多个GPU训练model? - How to train model with multiple GPUs in pytorch? 如何在 Pytorch1.1 和 DistributedDataParallel() 中计算米? - How to calculate meters in Pytorch1.1 & DistributedDataParallel()? 修改现有 Pytorch 代码以在多个 GPU 上运行 - Modify existing Pytorch code to run on multiple GPUs 具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步权重? - Is PyTorch DistributedDataParallel with different GPU speeds syncing weights? 在固定某些层的多个 GPU 上训练单个 pytorch 模型? - Train a single pytorch model on multiple GPUs with some layers fixed? 处理pytorch代码时如何利用所有GPU? - How to utilize all GPUs when dealing with pytorch code? 如何使用 pytorch 列出所有当前可用的 GPU? - How do I list all currently available GPUs with pytorch? 如何运行与拥抱脸的训练器 api 并行的分布式数据的端到端示例(理想情况下在单节点多 GPU 上)? - How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? 有没有办法替换 Pytorch 中用于 DDP(DistributedDataParallel) 的“allreduce_hook”? - Is there a way to replace the 'allreduce_hook' used for DDP(DistributedDataParallel) in Pytorch?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM