簡體 English 中英

如何在多個 GPU 的 Pytorch 示例中利用 DistributedDataParallel 的世界大小參數？

[英]How to leverage the world-size parameter for DistributedDataParallel in Pytorch example for multiple GPUs?

原文 2017-08-14 12:24:41 7 1 python/ amazon-ec2/ gpu/ pytorch

我在g2.2xlarge AWS 機器上運行這個 Pytorch 示例。 因此，當我運行time python imageNet.py ImageNet2 ，它在以下時間運行良好：

real    3m16.253s
user    1m50.376s
sys 1m0.872s

但是，當我添加world-size參數時，它會卡住並且不執行任何操作。 命令如下： time python imageNet.py --world-size 2 ImageNet2

那么，我如何利用此腳本中的world-size參數來利用DistributedDataParallel功能。 world-size 參數只不過是分布式進程的數量。

我是否為此啟動了另一個類似的實例？ 如果是，那么腳本如何識別實例？ 我是否需要添加一些參數，例如實例的 IP 之類的？

1 個解決方案

World size 參數是分布式訓練中的節點數，因此如果將 world size 設置為 2，則需要在另一個節點上以不同的等級運行相同的命令。 如果您只想增加單個節點上的 GPU 數量，則需要更改ngpus_per_node 。 查看此自述文件中的多節點示例。

如何在 pytorch 中使用多個 GPU？

[英]How to use multiple GPUs in pytorch?

如何在pytorch中用多個GPU訓練model？

[英]How to train model with multiple GPUs in pytorch？

如何在 Pytorch1.1 和 DistributedDataParallel() 中計算米？

[英]How to calculate meters in Pytorch1.1 & DistributedDataParallel()?

修改現有 Pytorch 代碼以在多個 GPU 上運行

[英]Modify existing Pytorch code to run on multiple GPUs

具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步權重？

[英]Is PyTorch DistributedDataParallel with different GPU speeds syncing weights?

在固定某些層的多個 GPU 上訓練單個 pytorch 模型？

[英]Train a single pytorch model on multiple GPUs with some layers fixed?

處理pytorch代碼時如何利用所有GPU？

[英]How to utilize all GPUs when dealing with pytorch code?

如何使用 pytorch 列出所有當前可用的 GPU？

[英]How do I list all currently available GPUs with pytorch?

如何運行與擁抱臉的訓練器 api 並行的分布式數據的端到端示例（理想情況下在單節點多 GPU 上）？

[英]How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)?

有沒有辦法替換 Pytorch 中用於 DDP(DistributedDataParallel) 的“allreduce_hook”？

[英]Is there a way to replace the 'allreduce_hook' used for DDP(DistributedDataParallel) in Pytorch?

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何在 pytorch 中使用多個 GPU？如何在pytorch中用多個GPU訓練model？如何在 Pytorch1.1 和 DistributedDataParallel() 中計算米？修改現有 Pytorch 代碼以在多個 GPU 上運行具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步權重？在固定某些層的多個 GPU 上訓練單個 pytorch 模型？處理pytorch代碼時如何利用所有GPU？如何使用 pytorch 列出所有當前可用的 GPU？如何運行與擁抱臉的訓練器 api 並行的分布式數據的端到端示例（理想情況下在單節點多 GPU 上）？有沒有辦法替換 Pytorch 中用於 DDP(DistributedDataParallel) 的“allreduce_hook”？

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM