简体   繁体   中英

torch training with Multi GPU enviroment

I'm trying to run a training on a multi gpu enviroment.

here's model code

net_1 = nn.Sequential(nn.Conv2d(2, 12, 5),
                nn.MaxPool2d(2),
                snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),
                nn.Conv2d(12, 32, 5),
                nn.MaxPool2d(2),
                snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),
                nn.Flatten(),
                nn.Linear(32*5*5, 10),
                snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True, output=True)
                )
net_1.cuda()
net = nn.DataParallel(net_1)

snn.Leaky is a module used to implement SNN structure combinig with torch.nn, Which makes.network work as kind of RNN. links here( https://snntorch.readthedocs.io/en/latest/readme.html )

The input shape looks like this (timestep, batchsize, 2, 32,32)

Training code

def forward_pass(net, data):
    spk_rec = []
    utils.reset(net)  # resets hidden states for all LIF neurons in net
    for step in range(data.size(1)):  # data.size(0) = number of time steps
        datas = data[:,step,:,:,:].cuda()
        net = net.to(device)
        spk_out, mem_out = net(datas)

        spk_rec.append(spk_out)

    return torch.stack(spk_rec)

optimizer = torch.optim.Adam(net.parameters(), lr=2e-2, betas=(0.9, 0.999))
loss_fn = SF.mse_count_loss(correct_rate=0.8, incorrect_rate=0.2)
num_epochs = 5
num_iters = 50

loss_hist = []
acc_hist = []
t_spk_rec_sum = []
start = time.time()

net.train()
# training loop
for epoch in range(num_epochs):
    for i, (data, targets) in enumerate(iter(trainloader)):
        data = data.to(device)
        targets = targets.to(device)


        spk_rec = forward_pass(net, data)
        loss_val = loss_fn(spk_rec, targets)

        # Gradient calculation + weight update
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
        # Store loss history for future plotting
        loss_hist.append(loss_val.item())
        print("time :", time.time() - start,"sec")
        print(f"Epoch {epoch}, Iteration {i} \nTrain Loss: {loss_val.item():.2f}")
        acc = SF.accuracy_rate(spk_rec, targets)
        acc_hist.append(acc)
        print(f"Train Accuracy: {acc * 100:.2f}%\n")

And I got this error

Traceback (most recent call last):
  File "/home/hubo1024/PycharmProjects/snntorch/multi_gpu_train.py", line 87, in <module>
    spk_rec = forward_pass(net, data)
  File "/home/hubo1024/PycharmProjects/snntorch/multi_gpu_train.py", line 63, in forward_pass
    spk_out, mem_out = net(datas)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 162, in forward
    self.mem = self.state_fn(input_)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 201, in _build_state_function_hidden
    self._base_state_function_hidden(input_) - self.reset * self.threshold
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 195, in _base_state_function_hidden
    base_fn = self.beta.clamp(0, 1) * self.mem + input_
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/_tensor.py", line 1121, in __torch_function__
    ret = func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!


Process finished with exit code 1

Line 87 is

spk_rec = forward_pass(net, data)

from traning loop

and line 63 is

    spk_out, mem_out = net(datas)

of forward pass function

I checked and made sure that there's no part where the tensor is defined as cpu, And the code works well when I run this code in single GPU.

I'm currently using

torch.utils.data import DataLoader

for making batch train loader. I'm thinking that this might be main source of the problem. Should I use different dataloader for multi GPU training? And if so where can I find some reference with this?, I serched a bit but those info where a bit old.

This was a bug in the Leaky neuron that kept resetting its device when using DataParallel. It has been fixed in the current version of snnTorch in GitHub, and addressed in this issue: https://github.com/jeshraghian/snntorch/issues/154

We're working on fixing up the other neurons now.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM