简体繁体中英

Does NVLink accelerate training with DistributedDataParallel?

原文 2021-01-18 15:42:30 1 1 deep-learning/ pytorch/ nvidia/ distributed-computing/ nvlink

Nvidia's NVLink accelerates data transfer between several GPUs on the same machine. I train large models on such a machine using PyTorch.

I see why NVLink would make model-parallel training faster, since one pass through a model will involve several GPUs.

But would it accelerate a data-parallel training process using DistributedDataParallel ?

1 answers

How does data-parallel training on k GPUs works?
You split your mini batch into k parts, each part is forwarded on a different GPU, and gradients are estimated on each GPU. However, (and this is super crucial) updating the weights must be synchronized between all GPUs. This is where NVLink becomes important for data-parallel training as well.

Process stuck when training on multiple nodes using PyTorch DistributedDataParallel

Can you accelerate torch DL training on anything other than "cuda" like "hip" or "OpenCL"?

training of hidden layers does not work

Does training for 10 epoch has equal effect as that of training for 5 epoch twice?

Training caffe library and Loss does not converge

Why does one not use IOU for training?

Net does not change weights during training, pytorch

Why does NN training loss flatten?

3D CNN training accuracy does not increase

Does tracking the loss via lists affect the training?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Process stuck when training on multiple nodes using PyTorch DistributedDataParallel Can you accelerate torch DL training on anything other than "cuda" like "hip" or "OpenCL"? training of hidden layers does not work Does training for 10 epoch has equal effect as that of training for 5 epoch twice? Training caffe library and Loss does not converge Why does one not use IOU for training? Net does not change weights during training, pytorch Why does NN training loss flatten? 3D CNN training accuracy does not increase Does tracking the loss via lists affect the training?

Related Tags

Does NVLink accelerate training with DistributedDataParallel?

Question

1 answers

solution1 2 ACCPTED 2021-01-18 15:53:43

solution1
2 ACCPTED 2021-01-18 15:53:43