简体   繁体   English

用完 GPU memory 和 PyTorch

[英]Running out of GPU memory with PyTorch

I am running my own custom deep belief.network code using PyTorch and using the LBFGS optimizer.我正在使用 PyTorch 和 LBFGS 优化器运行我自己的自定义 deep belief.network 代码。 After optimization starts, my GPU starts to run out of memory, fully running out after a couple of batches, but I'm not sure why.优化开始后,我的GPU开始用完memory,几批就用完了,但我不知道为什么。 Should I be purging memory after each batch is run through the optimizer?我应该在每个批次通过优化器运行后清除 memory 吗? My code is as follows (with the portion of code that causes the problem marked):我的代码如下(标出引起问题的部分代码):

def fine_tuning(self, data, labels, num_epochs=10, max_iter=3):
        '''
        Parameters
        ----------
        data : TYPE torch.Tensor
            N x D tensor with N = num samples, D = num dimensions
        labels : TYPE torch.Tensor
            N x 1 vector of labels for each sample
        num_epochs : TYPE, optional
            DESCRIPTION. The default is 10.
        max_iter : TYPE, optional
            DESCRIPTION. The default is 3.

        Returns
        -------
        None.

        '''
        N = data.shape[0]
        #need to unroll the weights into a typical autoencoder structure
        #encode - code - decode
        for ii in range(len(self.rbm_layers)-1, -1, -1):
            self.rbm_layers.append(self.rbm_layers[ii])
        
        L = len(self.rbm_layers)
        optimizer = torch.optim.LBFGS(params=list(itertools.chain(*[list(self.rbm_layers[ii].parameters()) 
                                                                    for ii in range(L)]
                                                                  )),
                                      max_iter=max_iter,
                                      line_search_fn='strong_wolfe') 
        
        dataset     = torch.utils.data.TensorDataset(data, labels)
        dataloader  = torch.utils.data.DataLoader(dataset, batch_size=self.batch_size*10, shuffle=True)
        #fine tune weights for num_epochs
        for epoch in range(1,num_epochs+1):
            with torch.no_grad():
                #get squared error before optimization
                v = self.pass_through_full(data)
                err = (1/N) * torch.sum(torch.pow(data-v.to("cpu"), 2))
            print("\nBefore epoch {}, train squared error: {:.4f}\n".format(epoch, err))
        
           #*******THIS IS THE PROBLEM SECTION*******#
            for ii,(batch,_) in tqdm(enumerate(dataloader), ascii=True, desc="DBN fine-tuning", file=sys.stdout):
                print("Fine-tuning epoch {}, batch {}".format(epoch, ii))
                with torch.no_grad():
                    batch = batch.view(len(batch) , self.rbm_layers[0].visible_units)
                    if self.use_gpu: #are we using a GPU?
                        batch = batch.to(self.device) #if so, send batch to GPU
                    B = batch.shape[0]
                    def closure():
                        optimizer.zero_grad()
                        output = self.pass_through_full(batch)
                        loss = nn.BCELoss(reduction='sum')(output, batch)/B
                        print("Batch {}, loss: {}\r".format(ii, loss))
                        loss.backward()
                        return loss
                    optimizer.step(closure)

The error I get is:我得到的错误是:

DBN fine-tuning: 0it [00:00, ?it/s]Fine-tuning epoch 1, batch 0     
Batch 0, loss: 4021.35400390625  
Batch 0, loss: 4017.994873046875  
DBN fine-tuning: 0it [00:00, ?it/s] 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>   
  File "/home/deep_autoencoder/deep_autoencoder.py", line 260, in fine_tuning  
    optimizer.step(closure)  
  File "/home/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/autograd
/grad_mode.py", line 15, in decorate_context 
    return func(*args, **kwargs)  
  File "/home/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/optim/lb
fgs.py", line 425, in step  
    loss, flat_grad, t, ls_func_evals = _strong_wolfe( 
  File "/home/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/optim/lb
fgs.py", line 96, in _strong_wolfe 
    g_prev = g_new.clone(memory_format=torch.contiguous_format) 
RuntimeError: CUDA out of memory. Tried to allocate 1.57 GiB (GPU 0; 24.00 GiB total capac
ity; 13.24 GiB already allocated; 1.41 GiB free; 20.07 GiB reserved in total by PyTorch)

This also racks up memory if I use CPU, so I'm not sure what the solution is here...如果我使用 CPU,这也会占用 memory,所以我不确定这里的解决方案是什么......

The official document on LBFGS says: LBFGS 的官方文档说:

This is a very memory intensive optimizer (it requires additional param_bytes * (history_size + 1) bytes ).这是一个非常 memory 密集型优化器(它需要额外param_bytes * (history_size + 1) bytes )。 If it doesn't fit in memory try reducing the history size, or use a different algorithm.如果它不适合 memory 尝试减少历史大小,或使用不同的算法。

Since I see you didn't specify the history_size parameter in the initialization call of torch.optim.LBFGS , it should be 100 by default.因为我看到你没有在torch.optim.LBFGS的初始化调用中指定history_size参数,默认情况下它应该是 100。 Since you have used more than 10GB memory for the first two batches, I guess you need at least hundreds of GB of memory.由于您前两批使用了超过 10GB 的 memory,我估计您至少需要数百 GB 的 memory。

I'd suggest setting history_size to 1 to confirm that the problem is indeed caused by saving too much history.我建议将history_size设置为1以确认问题确实是由保存太多历史记录引起的。 If it is, try solving it by reducing the history size or the parameter size.如果是,请尝试通过减小历史大小或参数大小来解决它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM