简体   繁体   English

具有不同 GPU 速度的 PyTorch DistributedDataParallel 是否同步权重?

[英]Is PyTorch DistributedDataParallel with different GPU speeds syncing weights?

In the following scenario, there are 2 GPUs, each with a significantly different speed: GPU0 is faster than GPU1 by around 40%.在以下场景中,有 2 个 GPU,每个 GPU 的速度明显不同:GPU0 比 GPU1 快约 40%。 I want to train the the model for 100k steps, which ordinarily if the speeds are the same, would be reached in the equivalent of 50k steps.我想将模型训练 100k 步,通常如果速度相同,将在相当于 50k 步时达到。

However, since the GPUs are of different speeds, when GPU0 hits 50K steps, GPU1 reaches only 30K steps.但是,由于 GPU 的速度不同,当 GPU0 达到 50K 步时,GPU1 仅达到 30K 步。 Effectively, the model has been trained for 80k steps.实际上,该模型已经训练了 80k 步。

In practice, will PyTorch's DistributedDataParallel work with GPUs of different speed?在实践中,PyTorch 的 DistributedDataParallel 是否可以与不同速度的 GPU 配合使用? Currently, the script [A] runs such that GPU0 proceeds at its original speed without waiting for GPU1, so I was wondering how any syncing would work.目前,脚本 [A] 的运行使得 GPU0 以其原始速度运行而无需等待GPU1,所以我想知道任何同步将如何工作。 I printed the parameters of the model in each GPU at the same step, and they are indeed different significantly.我在同一步骤中在每个GPU中打印了模型的参数,它们确实有显着差异。 If so, where does the syncing work?如果是这样,同步在哪里工作?

In the original source code [B] for DDP, it does seem that syncing is made before each forward pass of the model.在 DDP 的原始源代码 [B] 中,似乎在模型的每次前向传递之前进行同步。 But I don't know why if this is the case, the sum of the parameters in each GPU is off by around 1-2% of the total value.但我不知道为什么如果是这样,每个GPU中的参数总和会偏离总值的1-2%左右。

The function to get the parameters value is simply this:获取参数值的函数很简单:

def get_params_sum(net):
    total = 0
    for param in net.parameters():
        total = total + torch.sum(param)

    return total

Is there a way to automatically let GPU0 take over some "leftover" training of GPU1 when it's done?有没有办法在 GPU0 完成后自动让 GPU0 接管 GPU1 的一些“剩余”训练?

[A] A running script can be found here: https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-distributed.py [A] 运行脚本可以在这里找到: https : //github.com/yangkky/distributed_tutorial/blob/master/src/mnist-distributed.py

[B] https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L707 [B] https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L707

Since DDP fully syncs gradients at every step, the faster GPU0 should always wait for the slower GPU1.由于 DDP 在每一步都完全同步梯度,因此较快的 GPU0 应始终等待较慢的 GPU1。

The sync occurs at the backward step -- allreduce gradients.同步发生在后退步骤——allreduce 梯度。

DDP is not designed to run in a heterogeneous environment. DDP 不是为在异构环境中运行而设计的。 You may consider dividing the input proportional to the compute power of the two GPUs, and let DDP handle uneven DDP inputs .您可以考虑将输入与两个 GPU 的计算能力成正比,并让 DDP 处理不均匀的 DDP 输入

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM