[英]Increase of GPU memory usage during training
我在常用的 MNIST 数据集上训练网络,遇到了下一个问题:
当我开始将valid_metrics
添加到loss_list 和accuracy_list 时,正在使用的GPU 内存量开始每1 或2 个时期增加。 这是train_loop
的代码:
def train_model(model: torch.nn.Module,
train_dataset: torch.utils.data.Dataset,
valid_dataset: torch.utils.data.Dataset,
loss_function: torch.nn.Module = torch.nn.CrossEntropyLoss(),
optimizer_class: Type[torch.optim.Optimizer] = torch.optim,
optimizer_params: Dict = {},
initial_lr = 0.01,
lr_scheduler_class: Any = torch.optim.lr_scheduler.ReduceLROnPlateau,
lr_scheduler_params: Dict = {},
batch_size = 64,
max_epochs = 1000,
early_stopping_patience = 20):
optimizer = torch.optim.Adam(model.parameters(), lr=initial_lr, **optimizer_params)
lr_scheduler = lr_scheduler_class(optimizer, **lr_scheduler_params)
train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=batch_size)
best_valid_loss = None
best_epoch = None
loss_list = list()
accuracy_list = list()
for epoch in range(max_epochs):
print(f'Epoch {epoch}')
start = timer()
train_single_epoch(model, optimizer, loss_function, train_loader)
valid_metrics = validate_single_epoch(model, loss_function, valid_loader)
loss_list.append(valid_metrics['loss'])
accuracy_list.append(valid_metrics['accuracy'])
print('time:', timer() - start)
print(f'Validation metrics: \n{valid_metrics}')
lr_scheduler.step(valid_metrics['loss'])
if best_valid_loss is None or best_valid_loss > valid_metrics['loss']:
print(f'Best model yet, saving')
best_valid_loss = valid_metrics['loss']
best_epoch = epoch
torch.save(model, './best_model.pth')
if epoch - best_epoch > early_stopping_patience:
print('Early stopping triggered')
return loss_list, accuracy_list
和validate_single_epoch
的代码:
def validate_single_epoch(model: torch.nn.Module,
loss_function: torch.nn.Module,
data_loader: torch.utils.data.DataLoader):
loss_total = 0
accuracy_total = 0
for data in data_loader:
X, y = data
X, y = X.view(-1, 784), y.to(device)
X = X.to(device)
output = model(X)
loss = loss_function(output, y)
loss_total += loss
y_pred = output.argmax(dim = 1, keepdim=True).to(device)
accuracy_total += y_pred.eq(y.view_as(y_pred)).sum().item()
loss_avg = loss_total / len(data_loader.dataset)
accuracy_avg = 100.0 * accuracy_total / len(data_loader.dataset)
return {'loss' : loss_avg, 'accuracy' : accuracy_avg}
我使用GeForce MX250
作为 GPU
问题很可能是因为梯度正在计算并存储在验证循环中。 为了解决这个问题,也许最简单的方法是将验证调用包装在no_grad
上下文中:
with torch.no_grad():
valid_metrics = validate_single_epoch(model, loss_function, valid_loader)
如果您愿意,还可以使用@torch.no_grad()
装饰validate_single_epoch(...)
@torch.no_grad()
:
@torch.no_grad()
def validate_single_epoch(...):
# ...
与您的问题无关,但请注意您在验证期间在训练模式下使用模型,这可能不是您想要的。 也许在验证函数中缺少对model.eval()
的调用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.