[英]Using Multiple GPUs outside of training in PyTorch
I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. 我正在计算nn.Conv2d层内每对内核之间的累积距离。 However for large layers it runs out of memory using a Titan X with 12gb of memory.
但是,对于较大的层,使用具有12gb内存的Titan X会耗尽内存。 I'd like to know if it is possible to divide such calculations across two gpus.
我想知道是否可以将这样的计算结果划分为两个GPU。 The code follows:
代码如下:
def ac_distance(layer):
total = 0
for p in layer.weight:
for q in layer.weight:
total += distance(p,q)
return total
Where layer
is instance of nn.Conv2d
and distance returns the sum of the differences between p and q. 其中
layer
是nn.Conv2d
实例,而distance返回p和q之差的总和。 I can't detach the graph, however, for I need it later on. 但是,我无法分离该图,因为稍后需要它。 I tried wrapping my model around a nn.DataParallel, but all calculations in
ac_distance
are done using only 1 gpu, however it trains using both. 我尝试将模型包装在nn.DataParallel周围,但是
ac_distance
中的所有计算仅使用1 gpu完成,但是使用这两者进行训练。
Parallelism while training neural networks can be achieved in two ways. 训练神经网络的并行性可以通过两种方式实现。
As you have asked in the question, you would like to split the calculation which falls into the second category. 正如您在问题中所提出的,您想将计算划分为第二类。 There are no out-of-the-box ways to achieve model parallelism.
没有开箱即用的方式来实现模型并行性。 PyTorch provides primitives for parallel processing using the
torch.distributed
package. PyTorch提供了使用
torch.distributed
包进行并行处理的原语。 This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need. 本教程全面介绍了该软件包的详细信息,您可以制定一种方法来实现所需的模型并行性。
However, model parallelism can be very complex to achieve. 但是,模型并行性实现起来可能非常复杂。 The general way is to do data parallelism with either
torch.nn.DataParallel
or torch.nn.DistributedDataParallel
. 一般的方法是使用
torch.nn.DataParallel
或torch.nn.DistributedDataParallel
进行数据并行torch.nn.DistributedDataParallel
。 In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. 在这两种方法中,您将在两个不同的GPU上运行相同的模型,但是一大批将被分成两个较小的块。 The gradients will be accumulated on a single GPU and optimization happens.
梯度将累积在单个GPU上并进行优化。 Optimization takes place on a single GPU in
Dataparallel
and parallely across GPUs in DistributedDataParallel
by using multiprocessing. 通过使用多处理,可以在
Dataparallel
的单个GPU上进行Dataparallel
而在DistributedDataParallel
的GPU之间可以并行进行优化。
In your case, if you use DataParallel
, the computation would still take place on two different GPUs. 在您的情况下,如果使用
DataParallel
,则计算仍将在两个不同的GPU上进行。 If you notice imbalance in GPU usage it could be because of the way DataParallel
has been designed. 如果您发现GPU使用情况不平衡,可能是由于
DataParallel
设计方式DataParallel
。 You can try using DistributedDataParallel
which is the fastest way to train on multiple GPUs according to the docs . 根据文档,您可以尝试使用
DistributedDataParallel
,这是在多个GPU上进行训练的最快方法。
There are other ways to process very large batches too. 还有其他方法可以处理非常大的批次。 This article goes through them in detail and I'm sure it would be helpful.
本文对它们进行了详细介绍,我相信这会有所帮助。 Few important points:
几个要点:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.