繁体 English 中英

如何在多个 GPU 的 Pytorch 示例中利用 DistributedDataParallel 的世界大小参数？

[英]How to leverage the world-size parameter for DistributedDataParallel in Pytorch example for multiple GPUs?

原文 2017-08-14 12:24:41 8 1 python/ amazon-ec2/ gpu/ pytorch

我在g2.2xlarge AWS 机器上运行这个 Pytorch 示例。 因此，当我运行time python imageNet.py ImageNet2 ，它在以下时间运行良好：

real    3m16.253s
user    1m50.376s
sys 1m0.872s

但是，当我添加world-size参数时，它会卡住并且不执行任何操作。 命令如下： time python imageNet.py --world-size 2 ImageNet2

那么，我如何利用此脚本中的world-size参数来利用DistributedDataParallel功能。 world-size 参数只不过是分布式进程的数量。

我是否为此启动了另一个类似的实例？ 如果是，那么脚本如何识别实例？ 我是否需要添加一些参数，例如实例的 IP 之类的？

1 个解决方案

World size 参数是分布式训练中的节点数，因此如果将 world size 设置为 2，则需要在另一个节点上以不同的等级运行相同的命令。 如果您只想增加单个节点上的 GPU 数量，则需要更改ngpus_per_node 。 查看此自述文件中的多节点示例。

如何在 pytorch 中使用多个 GPU？

[英]How to use multiple GPUs in pytorch?

如何在pytorch中用多个GPU训练model？

[英]How to train model with multiple GPUs in pytorch？

如何在 Pytorch1.1 和 DistributedDataParallel() 中计算米？

[英]How to calculate meters in Pytorch1.1 & DistributedDataParallel()?

修改现有 Pytorch 代码以在多个 GPU 上运行

[英]Modify existing Pytorch code to run on multiple GPUs

具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步权重？

[英]Is PyTorch DistributedDataParallel with different GPU speeds syncing weights?

在固定某些层的多个 GPU 上训练单个 pytorch 模型？

[英]Train a single pytorch model on multiple GPUs with some layers fixed?

处理pytorch代码时如何利用所有GPU？

[英]How to utilize all GPUs when dealing with pytorch code?

如何使用 pytorch 列出所有当前可用的 GPU？

[英]How do I list all currently available GPUs with pytorch?

如何运行与拥抱脸的训练器 api 并行的分布式数据的端到端示例（理想情况下在单节点多 GPU 上）？

[英]How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

有没有办法替换 Pytorch 中用于 DDP(DistributedDataParallel) 的“allreduce_hook”？

[英]Is there a way to replace the 'allreduce_hook' used for DDP(DistributedDataParallel) in Pytorch?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 pytorch 中使用多个 GPU？如何在pytorch中用多个GPU训练model？如何在 Pytorch1.1 和 DistributedDataParallel() 中计算米？修改现有 Pytorch 代码以在多个 GPU 上运行具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步权重？在固定某些层的多个 GPU 上训练单个 pytorch 模型？处理pytorch代码时如何利用所有GPU？如何使用 pytorch 列出所有当前可用的 GPU？如何运行与拥抱脸的训练器 api 并行的分布式数据的端到端示例（理想情况下在单节点多 GPU 上）？有没有办法替换 Pytorch 中用于 DDP(DistributedDataParallel) 的“allreduce_hook”？

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM