[英]Continue training with torch.save and torch.load - key error messages
我是 Torch 的新手,正在使用 masked-cnn model 的代碼模板。為了在訓練中斷時做好准備,我在代碼中使用了 torch.save 和 torch.load,但我認為我不能單獨使用它繼續培訓課程? 我開始訓練:
model = train_mask_net(64)
這調用了 function train_mask.net,我在其中將 torch.save 包含在紀元循環中。 我想加載一個已保存的模型並在循環前繼續使用 torch.load 進行訓練,但我收到了優化器、損失和紀元調用的“關鍵錯誤”消息。 我是否應該像我在某些教程中看到的那樣創建一個特定的檢查點 function,或者我是否可以繼續使用 torch.saved 命令保存的文件進行訓練?
def train_mask_net(num_epochs=1):
data = MaskDataset(list(data_mask.keys()))
data_loader = torch.utils.data.DataLoader(data, batch_size=8, shuffle=True, num_workers=4)
model = XceptionHourglass(max_clz+2)
model.cuda()
dp = torch.nn.DataParallel(model)
loss = nn.CrossEntropyLoss()
params = [p for p in dp.parameters() if p.requires_grad]
optimizer = torch.optim.RMSprop(params, lr=2.5e-4, momentum=0.9)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=6,
gamma=0.9)
checkpoint = torch.load('imaterialist2020-pretrain-models/maskmodel_160.model_ep17')
#print(checkpoint)
model.load_state_dict(checkpoint)
#optimizer.load_state_dict(checkpoint)
#epoch = checkpoint['epoch']
#loss = checkpoint['loss']
for epoch in range(num_epochs):
print(epoch)
total_loss = []
prog = tqdm(data_loader, total=len(data_loader))
for i, (imag, mask) in enumerate(prog):
X = imag.cuda()
y = mask.cuda()
xx = dp(X)
# to 1D-array
y = y.reshape((y.size(0),-1)) # batch, flatten-img
y = y.reshape((y.size(0) * y.size(1),)) # flatten-all
xx = xx.reshape((xx.size(0), xx.size(1), -1)) # batch, channel, flatten-img
xx = torch.transpose(xx, 2, 1) # batch, flatten-img, channel
xx = xx.reshape((xx.size(0) * xx.size(1),-1)) # flatten-all, channel
losses = loss(xx, y)
prog.set_description("loss:%05f"%losses)
optimizer.zero_grad()
losses.backward()
optimizer.step()
total_loss.append(losses.detach().cpu().numpy())
torch.save(model.state_dict(), MODEL_FILE_DIR+"maskmodel_%d.model"%attr_image_size[0]+'_ep'+str(epoch)+'_tsave')
prog, X, xx, y, losses = None, None, None, None, None,
torch.cuda.empty_cache()
gc.collect()
return model
我認為沒有必要,但 xceptionhour class 看起來像這樣:
class XceptionHourglass(nn.Module):
def __init__(self, num_classes):
super(XceptionHourglass, self).__init__()
self.num_classes = num_classes
self.conv1 = nn.Conv2d(3, 128, 3, 2, 1, bias=True)
self.bn1 = nn.BatchNorm2d(128)
self.mish = Mish()
self.conv2 = nn.Conv2d(128, 256, 3, 1, 1, bias=True)
self.bn2 = nn.BatchNorm2d(256)
self.block1 = HourglassNet(4, 256)
self.bn3 = nn.BatchNorm2d(256)
self.block2 = HourglassNet(4, 256)
...
torch.save(model.state_dict(), PATH)
只保存 model 個權重。
要同時保存優化器、損失、epoch 等,將其更改為:
torch.save({'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'loss': loss,
'epoch': epoch,
# ...
}, PATH)
加載它們:
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
更多關於它在這里。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.