I met some problems when using pytorch DistributedDataParallel
. The situation is:
My model is A , and it has been trained on a single GPU as usual. Suppose that there are three layers in A :
class A(nn.module): def __init__(self): super(A,self).__init__() self.layer0 = layer0 self.layer1 = layer1 self.layer2 = layer2 def forward(self,x): x=self.layer0(x) x=self.layer1(x) x=self.layer2(x) return x
Now I have some new data. I want to fine-tune A with it on multiple GPUs. I need to wrap A as a multi-GPU model B .
But there are two training stages. In the 1st stage, I want to fix layer0
and layer1
of B . In the 2nd stage, only to fix layer0
. Then requires_grad
of parameters in layer1
should be changed during training. However, DistributedDataParallel
doc says:
You should never try to change your model's parameters after wrapping up your model with DistributedDataParallel.
In fact, I tried to use B.module
to refer A wrapped in B . But the test results were abnormal compared to the single-GPU model. Maybe this way is disallowed.
What should I do? Is there any proper way to wrap my model? And what should be take care for when saving and loading the model?
Just run it on a single machine with multiple GPUs so you can ignore the distributed situation using multiple machines. Many thanks.
Update 2019.12.03
As suggested by @jodag, I tried DataParallel
, but it didn't work. This time I didn't change anything in B (except training it) after wrapping it. For simplification, My code is like this (and I refered this ):
class B(nn.DataParallel):
def __getattr__(self, name):
try:
return super().__getattr__(name)
except AttributeError:
return getattr(self.module, name)
a = A()
b = B(a,device_ids=[0,1])
b = b.cuda()
trained_param = b.layer2.parameters()
# trained_param = [{'params':b.layer2.parameters()},{'params':b.layer1.parameters()}]
optimizer = optim.Adam(trained_param)
b.train()
...
for x, label in data_loader:
optimizer.zero_grad()
x = x.to(0) # This line can be commented.
y = b(x)
l = loss(y, label)
l.backword()
optimizer.step()
If you only try to optimize part of the parameters, why not try controlling this via the optimizer , rather than the model?
You can leave your model as-is (wrapped in a DistributedDataParallel
) and pass only part of its parameters to the relevant optimizer.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.