I am new to Torch and using a code template for a masked-cnn model. In order to be prepared if the training is interrupted, I have used torch.save and torch.load in my code, but I think I cannot use this alone for continuing training sessions? I start training by:
model = train_mask_net(64)
This calls the function train_mask.net where I have included torch.save in the epoch loop. I wanted to load one of the saved models and continue training with torch.load in front of the loop, but I got "key error" messages for the optimizer, loss and epoch call. Should I have made a specific checkpoint function as I have seen in some tutorials or is there a possibility that I can continue training with the files saved by the torch.saved command?
def train_mask_net(num_epochs=1):
data = MaskDataset(list(data_mask.keys()))
data_loader = torch.utils.data.DataLoader(data, batch_size=8, shuffle=True, num_workers=4)
model = XceptionHourglass(max_clz+2)
model.cuda()
dp = torch.nn.DataParallel(model)
loss = nn.CrossEntropyLoss()
params = [p for p in dp.parameters() if p.requires_grad]
optimizer = torch.optim.RMSprop(params, lr=2.5e-4, momentum=0.9)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=6,
gamma=0.9)
checkpoint = torch.load('imaterialist2020-pretrain-models/maskmodel_160.model_ep17')
#print(checkpoint)
model.load_state_dict(checkpoint)
#optimizer.load_state_dict(checkpoint)
#epoch = checkpoint['epoch']
#loss = checkpoint['loss']
for epoch in range(num_epochs):
print(epoch)
total_loss = []
prog = tqdm(data_loader, total=len(data_loader))
for i, (imag, mask) in enumerate(prog):
X = imag.cuda()
y = mask.cuda()
xx = dp(X)
# to 1D-array
y = y.reshape((y.size(0),-1)) # batch, flatten-img
y = y.reshape((y.size(0) * y.size(1),)) # flatten-all
xx = xx.reshape((xx.size(0), xx.size(1), -1)) # batch, channel, flatten-img
xx = torch.transpose(xx, 2, 1) # batch, flatten-img, channel
xx = xx.reshape((xx.size(0) * xx.size(1),-1)) # flatten-all, channel
losses = loss(xx, y)
prog.set_description("loss:%05f"%losses)
optimizer.zero_grad()
losses.backward()
optimizer.step()
total_loss.append(losses.detach().cpu().numpy())
torch.save(model.state_dict(), MODEL_FILE_DIR+"maskmodel_%d.model"%attr_image_size[0]+'_ep'+str(epoch)+'_tsave')
prog, X, xx, y, losses = None, None, None, None, None,
torch.cuda.empty_cache()
gc.collect()
return model
I don't think its necessary, but the xceptionhour class looks like this:
class XceptionHourglass(nn.Module):
def __init__(self, num_classes):
super(XceptionHourglass, self).__init__()
self.num_classes = num_classes
self.conv1 = nn.Conv2d(3, 128, 3, 2, 1, bias=True)
self.bn1 = nn.BatchNorm2d(128)
self.mish = Mish()
self.conv2 = nn.Conv2d(128, 256, 3, 1, 1, bias=True)
self.bn2 = nn.BatchNorm2d(256)
self.block1 = HourglassNet(4, 256)
self.bn3 = nn.BatchNorm2d(256)
self.block2 = HourglassNet(4, 256)
...
torch.save(model.state_dict(), PATH)
only saves the model weights.
To also save optimizer, loss, epoch, etc., change it to:
torch.save({'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'loss': loss,
'epoch': epoch,
# ...
}, PATH)
To load them:
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
More on it here .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.