如何處理Torch中的GPU內存泄漏問題？

Question

我的機器的GPU有2 GB的內存。 當我第一次運行以下代碼時，我沒有錯誤。 但是，第二次運行代碼時出現內存錯誤。 作為一種短期補救措施，我唯一能做的就是使用torch.Tensor.float()將數據轉換為float32。 但是，問題仍然存在，並且在完成該過程后未釋放占用的內存，或者在運行時終止該過程。 這也是機器RAM的情況。 如何防止Torch中的內存泄漏或釋放內存？

require 'nn'
require 'image'
require 'cunn'
require 'paths'



collectgarbage(); collectgarbage()
if (not paths.filep("cifar10torchsmall.zip")) then
    os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip')
    os.execute('unzip cifar10torchsmall.zip')
end
trainset = torch.load('cifar10-train.t7')
testset = torch.load('cifar10-test.t7')
classes = {'airplane', 'automobile', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck'}

setmetatable(trainset, 
    {__index = function(t, i) 
                    return {t.data[i], t.label[i]} 
                end}
);
trainset.data = trainset.data:double() -- convert the data from a ByteTensor to a DoubleTensor.

function trainset:size() 
    return self.data:size(1) 
end

mean = {} -- store the mean, to normalize the test set in the future
stdv  = {} -- store the standard-deviation for the future
for i=1,3 do -- over each image channel
    mean[i] = trainset.data[{ {}, {i}, {}, {}  }]:mean() -- mean estimation
    print('Channel ' .. i .. ', Mean: ' .. mean[i])
    trainset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction

    stdv[i] = trainset.data[{ {}, {i}, {}, {}  }]:std() -- std estimation
    print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i])
    trainset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end


testset.data = testset.data:double()   -- convert from Byte tensor to Double tensor
for i=1,3 do -- over each image channel
    testset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction    
    testset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end

trainset.data = trainset.data:cuda()
testset.data = testset.data:cuda()

net = nn.Sequential()
net:add(nn.SpatialConvolution(3, 6, 5, 5)) -- 3 input image channels, 6 output channels, 5x5 convolution kernel
net:add(nn.ReLU())                       -- non-linearity 
net:add(nn.SpatialMaxPooling(2,2,2,2))     -- A max-pooling operation that looks at 2x2 windows and finds the max.
net:add(nn.SpatialConvolution(6, 16, 5, 5))
net:add(nn.ReLU())                       -- non-linearity 
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(16*5*5))                    -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5
net:add(nn.Linear(16*5*5, 120))             -- fully connected layer (matrix multiplication between input and weights)
net:add(nn.ReLU())                       -- non-linearity 
net:add(nn.Linear(120, 84))
net:add(nn.ReLU())                       -- non-linearity 
net:add(nn.Linear(84, 10))                   -- 10 is the number of outputs of the network (in this case, 10 digits)
net:add(nn.LogSoftMax())  
net = net:cuda()

criterion = nn.ClassNLLCriterion()
criterion = criterion:cuda()



pred = net:forward(trainset.data)
outputEr = criterion:forward(pred, trainset.label:cuda())
net:zeroGradParameters()
outputGrad = criterion:backward(pred, trainset.label:cuda())
collectgarbage()
inputGrad = net:backward(trainset.data, outputGrad)

附帶問題：為什么Torch將網絡參數初始化為double，盡管GPU在計算雙精度運算時速度很慢，而且幾乎所有神經網絡應用程序實際上都不需要64位參數值？ 如何使用float（32位）參數向量初始化模型？

我找到了問題的答案。 您可以使用代碼開頭的以下內容輕松地將Torch的默認數據類型設置為float：

torch.setdefaulttensortype('torch.FloatTensor')

Answer 1

我可以通過在我正在進行上述實驗的機器上從CUDA 6.5升級到CUDA 7.5來解決這個問題（差不多）。 現在，大多數時候程序在運行GPU內存時崩潰了。 但是，有時它仍然沒有發生，我必須重新啟動機器。

此外，我會執行以下操作以確保程序在程序成功運行時清除GPU內存：

net = nil
trainset = nil
testset = nil
pred = nil
inputGrad = nil
criterion = nil

collectgarbage()

如何處理Torch中的GPU內存泄漏問題？

問題描述

1 個解決方案

解決方案1
2 已采納 2016-04-08 00:15:44

如何處理Torch中的GPU內存泄漏問題？

問題描述

1 個解決方案

解決方案1 2 已采納 2016-04-08 00:15:44

解決方案1
2 已采納 2016-04-08 00:15:44