在固定某些层的多个 GPU 上训练单个 pytorch 模型？

Question

I met some problems when using pytorch DistributedDataParallel .我在使用 pytorch DistributedDataParallel时遇到了一些问题。 The situation is:情况是：

My model is A , and it has been trained on a single GPU as usual.我的模型是A ，它像往常一样在单个GPU 上训练。 Suppose that there are three layers in A :假设有三层答：

 class A(nn.module): def __init__(self): super(A,self).__init__() self.layer0 = layer0 self.layer1 = layer1 self.layer2 = layer2 def forward(self,x): x=self.layer0(x) x=self.layer1(x) x=self.layer2(x) return x

Now I have some new data.现在我有了一些新数据。 I want to fine-tune A with it on multiple GPUs.我想在多个GPU 上用它微调A。 I need to wrap A as a multi-GPU model B .我需要将A包装为多 GPU 模型B 。
But there are two training stages.但是有两个训练阶段。 In the 1st stage, I want to fix layer0 and layer1 of B .在第一阶段，我想修复B 的layer0和layer1 。 In the 2nd stage, only to fix layer0 .在第二阶段，只修复layer0 。 Then requires_grad of parameters in layer1 should be changed during training.然后requires_grad在训练期间更改layer1 requires_grad参数。 However, DistributedDataParalleldoc says:但是， DistributedDataParallel文档说：

You should never try to change your model's parameters after wrapping up your model with DistributedDataParallel.在使用 DistributedDataParallel 包装模型后，您永远不应该尝试更改模型的参数。

In fact, I tried to use B.module to refer A wrapped in B .事实上，我试图用B.module指一个包裹在B中。 But the test results were abnormal compared to the single-GPU model.但与单GPU模型相比，测试结果异常。 Maybe this way is disallowed.也许这种方式是不允许的。

What should I do?我该怎么办？ Is there any proper way to wrap my model?有什么合适的方法来包装我的模型吗？ And what should be take care for when saving and loading the model?保存和加载模型时应该注意什么？

Just run it on a single machine with multiple GPUs so you can ignore the distributed situation using multiple machines.只需在具有多个 GPU 的单台机器上运行它，这样您就可以忽略使用多台机器的分布式情况。 Many thanks.非常感谢。

Update 2019.12.03 2019.12.03 更新

As suggested by @jodag, I tried DataParallel , but it didn't work.正如@jodag 所建议的，我尝试了DataParallel ，但没有用。 This time I didn't change anything in B (except training it) after wrapping it.这次我在包装之后没有改变B中的任何东西（除了训练它）。 For simplification, My code is like this (and I refered this ):为简单起见，我的代码是这样的（我参考了这个）：

class B(nn.DataParallel):
     def __getattr__(self, name):
        try:
            return super().__getattr__(name)
        except AttributeError:
            return getattr(self.module, name)
a = A()
b = B(a,device_ids=[0,1])
b = b.cuda()
trained_param = b.layer2.parameters()
# trained_param = [{'params':b.layer2.parameters()},{'params':b.layer1.parameters()}]
optimizer = optim.Adam(trained_param)
b.train()
...
for x, label in data_loader:
    optimizer.zero_grad()
    x = x.to(0) # This line can be commented.
    y = b(x)
    l = loss(y, label)
    l.backword()
    optimizer.step()

Answer 1

If you only try to optimize part of the parameters, why not try controlling this via the optimizer , rather than the model?如果您只尝试优化部分参数，为什么不尝试通过优化器而不是模型来控制它？
You can leave your model as-is (wrapped in a DistributedDataParallel ) and pass only part of its parameters to the relevant optimizer.您可以保持模型原样（包装在DistributedDataParallel ）并仅将其部分参数传递给相关优化器。

在固定某些层的多个 GPU 上训练单个 pytorch 模型？

问题描述

1 个解决方案

解决方案1
1 2019-12-02 09:48:03

在固定某些层的多个 GPU 上训练单个 pytorch 模型？

问题描述

1 个解决方案

解决方案1 1 2019-12-02 09:48:03

解决方案1
1 2019-12-02 09:48:03