简体   繁体   English

在PyTorch培训之外使用多个GPU

[英]Using Multiple GPUs outside of training in PyTorch

I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. 我正在计算nn.Conv2d层内每对内核之间的累积距离。 However for large layers it runs out of memory using a Titan X with 12gb of memory. 但是,对于较大的层,使用具有12gb内存的Titan X会耗尽内存。 I'd like to know if it is possible to divide such calculations across two gpus. 我想知道是否可以将这样的计算结果划分为两个GPU。 The code follows: 代码如下:

def ac_distance(layer):
    total = 0
    for p in layer.weight:
      for q in layer.weight:
         total += distance(p,q)
    return total

Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. 其中layernn.Conv2d实例,而distance返回p和q之差的总和。 I can't detach the graph, however, for I need it later on. 但是,我无法分离该图,因为稍后需要它。 I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both. 我尝试将模型包装在nn.DataParallel周围,但是ac_distance中的所有计算仅使用1 gpu完成,但是使用这两者进行训练。

Parallelism while training neural networks can be achieved in two ways. 训练神经网络的并行性可以通过两种方式实现。

  1. Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively 数据并行化-将大批分成两部分,并执行相同的操作集,但分别在两个不同的GPU上进行
  2. Model Parallelism - Split the computations and run them on different GPUs 并行模型-拆分计算并在不同的GPU上运行

As you have asked in the question, you would like to split the calculation which falls into the second category. 正如您在问题中所提出的,您想将计算划分为第二类。 There are no out-of-the-box ways to achieve model parallelism. 没有开箱即用的方式来实现模型并行性。 PyTorch provides primitives for parallel processing using the torch.distributed package. PyTorch提供了使用torch.distributed包进行并行处理的原语。 This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need. 教程全面介绍了该软件包的详细信息,您可以制定一种方法来实现所需的模型并行性。

However, model parallelism can be very complex to achieve. 但是,模型并行性实现起来可能非常复杂。 The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel . 一般的方法是使用torch.nn.DataParalleltorch.nn.DistributedDataParallel进行数据并行torch.nn.DistributedDataParallel In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. 在这两种方法中,您将在两个不同的GPU上运行相同的模型,但是一大批将被分成两个较小的块。 The gradients will be accumulated on a single GPU and optimization happens. 梯度将累积在单个GPU上并进行优化。 Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing. 通过使用多处理,可以在Dataparallel的单个GPU上进行Dataparallel而在DistributedDataParallel的GPU之间可以并行进行优化。

In your case, if you use DataParallel , the computation would still take place on two different GPUs. 在您的情况下,如果使用DataParallel ,则计算仍将在两个不同的GPU上进行。 If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. 如果您发现GPU使用情况不平衡,可能是由于DataParallel设计方式DataParallel You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs . 根据文档,您可以尝试使用DistributedDataParallel ,这是在多个GPU上进行训练的最快方法。

There are other ways to process very large batches too. 还有其他方法可以处理非常大的批次。 This article goes through them in detail and I'm sure it would be helpful. 本文对它们进行了详细介绍,我相信这会有所帮助。 Few important points: 几个要点:

  • Do gradient accumulation for larger batches 对较大的批次进行梯度累积
  • Use DataParallel 使用DataParallel
  • If that doesn't suffice, go with DistributedDataParallel 如果那还不够,请使用DistributedDataParallel

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM