简体繁体 English

NVLink 是否使用 DistributedDataParallel 加速训练？

[英]Does NVLink accelerate training with DistributedDataParallel?

原文 2021-01-18 15:42:30 0 1 deep-learning/ pytorch/ nvidia/ distributed-computing/ nvlink

Nvidia's NVLink accelerates data transfer between several GPUs on the same machine. Nvidia 的NVLink加速了同一台机器上多个 GPU 之间的数据传输。 I train large models on such a machine using PyTorch.我使用 PyTorch 在这样的机器上训练大型模型。

I see why NVLink would make model-parallel training faster, since one pass through a model will involve several GPUs.我明白为什么 NVLink 会使模型并行训练更快，因为一次通过 model 将涉及多个 GPU。

But would it accelerate a data-parallel training process using DistributedDataParallel ?但它会加速使用DistributedDataParallel的数据并行训练过程吗？

1 个解决方案

How does data-parallel training on k GPUs works? k个 GPU 上的数据并行训练如何工作？
You split your mini batch into k parts, each part is forwarded on a different GPU, and gradients are estimated on each GPU.您将您的小批量分成k个部分，每个部分在不同的 GPU 上转发，并且在每个 GPU 上估计梯度。 However, (and this is super crucial) updating the weights must be synchronized between all GPUs.但是，（这是非常关键的）更新权重必须在所有 GPU 之间同步。 This is where NVLink becomes important for data-parallel training as well.这也是 NVLink 对于数据并行训练变得重要的地方。

使用 PyTorch DistributedDataParallel 在多个节点上训练时进程卡住 - Process stuck when training on multiple nodes using PyTorch DistributedDataParallel

您能否在“hip”或“OpenCL”等“cuda”以外的任何东西上加速 Torch DL 训练？ - Can you accelerate torch DL training on anything other than "cuda" like "hip" or "OpenCL"?

训练隐藏层不起作用 - training of hidden layers does not work

10 epoch 的训练和 5 epoch 训练两次的效果一样吗？ - Does training for 10 epoch has equal effect as that of training for 5 epoch twice?

训练Caffe库，并且Loss无法收敛 - Training caffe library and Loss does not converge

为什么不使用IOU进行训练？ - Why does one not use IOU for training?

Net 在训练过程中不会改变权重，pytorch - Net does not change weights during training, pytorch

为什么 NN 训练损失会变平？ - Why does NN training loss flatten?

3D CNN训练准确率不提高 - 3D CNN training accuracy does not increase

通过列表跟踪损失会影响训练吗？ - Does tracking the loss via lists affect the training?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 PyTorch DistributedDataParallel 在多个节点上训练时进程卡住 - Process stuck when training on multiple nodes using PyTorch DistributedDataParallel 您能否在“hip”或“OpenCL”等“cuda”以外的任何东西上加速 Torch DL 训练？ - Can you accelerate torch DL training on anything other than "cuda" like "hip" or "OpenCL"? 训练隐藏层不起作用 - training of hidden layers does not work 10 epoch 的训练和 5 epoch 训练两次的效果一样吗？ - Does training for 10 epoch has equal effect as that of training for 5 epoch twice? 训练Caffe库，并且Loss无法收敛 - Training caffe library and Loss does not converge 为什么不使用IOU进行训练？ - Why does one not use IOU for training? Net 在训练过程中不会改变权重，pytorch - Net does not change weights during training, pytorch 为什么 NN 训练损失会变平？ - Why does NN training loss flatten? 3D CNN训练准确率不提高 - 3D CNN training accuracy does not increase 通过列表跟踪损失会影响训练吗？ - Does tracking the loss via lists affect the training?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM