使用 Multi GPU 环境进行火炬训练

Question

I'm trying to run a training on a multi gpu enviroment.我正在尝试在多 gpu 环境中进行培训。

here's model code这是 model 代码

net_1 = nn.Sequential(nn.Conv2d(2, 12, 5),
                nn.MaxPool2d(2),
                snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),
                nn.Conv2d(12, 32, 5),
                nn.MaxPool2d(2),
                snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),
                nn.Flatten(),
                nn.Linear(32*5*5, 10),
                snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True, output=True)
                )
net_1.cuda()
net = nn.DataParallel(net_1)

snn.Leaky is a module used to implement SNN structure combinig with torch.nn, Which makes.network work as kind of RNN. snn.Leaky 是一个用于实现 SNN 结构的模块，与 torch.nn 相结合，使 .network 像 RNN 一样工作。 links here( https://snntorch.readthedocs.io/en/latest/readme.html )链接在这里（ https://snntorch.readthedocs.io/en/latest/readme.html ）

The input shape looks like this (timestep, batchsize, 2, 32,32)输入形状如下所示 (timestep, batchsize, 2, 32,32)

Training code训练代码

def forward_pass(net, data):
    spk_rec = []
    utils.reset(net)  # resets hidden states for all LIF neurons in net
    for step in range(data.size(1)):  # data.size(0) = number of time steps
        datas = data[:,step,:,:,:].cuda()
        net = net.to(device)
        spk_out, mem_out = net(datas)

        spk_rec.append(spk_out)

    return torch.stack(spk_rec)

optimizer = torch.optim.Adam(net.parameters(), lr=2e-2, betas=(0.9, 0.999))
loss_fn = SF.mse_count_loss(correct_rate=0.8, incorrect_rate=0.2)
num_epochs = 5
num_iters = 50

loss_hist = []
acc_hist = []
t_spk_rec_sum = []
start = time.time()

net.train()
# training loop
for epoch in range(num_epochs):
    for i, (data, targets) in enumerate(iter(trainloader)):
        data = data.to(device)
        targets = targets.to(device)


        spk_rec = forward_pass(net, data)
        loss_val = loss_fn(spk_rec, targets)

        # Gradient calculation + weight update
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
        # Store loss history for future plotting
        loss_hist.append(loss_val.item())
        print("time :", time.time() - start,"sec")
        print(f"Epoch {epoch}, Iteration {i} \nTrain Loss: {loss_val.item():.2f}")
        acc = SF.accuracy_rate(spk_rec, targets)
        acc_hist.append(acc)
        print(f"Train Accuracy: {acc * 100:.2f}%\n")

And I got this error我得到了这个错误

Traceback (most recent call last):
  File "/home/hubo1024/PycharmProjects/snntorch/multi_gpu_train.py", line 87, in <module>
    spk_rec = forward_pass(net, data)
  File "/home/hubo1024/PycharmProjects/snntorch/multi_gpu_train.py", line 63, in forward_pass
    spk_out, mem_out = net(datas)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 162, in forward
    self.mem = self.state_fn(input_)
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 201, in _build_state_function_hidden
    self._base_state_function_hidden(input_) - self.reset * self.threshold
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 195, in _base_state_function_hidden
    base_fn = self.beta.clamp(0, 1) * self.mem + input_
  File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/_tensor.py", line 1121, in __torch_function__
    ret = func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!


Process finished with exit code 1

Line 87 is第87行是

spk_rec = forward_pass(net, data)

from traning loop来自训练循环

and line 63 is第 63 行是

    spk_out, mem_out = net(datas)

of forward pass function前传 function

I checked and made sure that there's no part where the tensor is defined as cpu, And the code works well when I run this code in single GPU.我检查并确保没有将张量定义为 cpu 的部分，并且当我在单个 GPU 中运行此代码时，代码运行良好。

I'm currently using我目前正在使用

torch.utils.data import DataLoader

for making batch train loader.用于制作批量火车装载机。 I'm thinking that this might be main source of the problem.我认为这可能是问题的主要原因。 Should I use different dataloader for multi GPU training?我应该使用不同的数据加载器进行多 GPU 培训吗？ And if so where can I find some reference with this?, I serched a bit but those info where a bit old.如果是这样，我在哪里可以找到一些参考资料？我搜索了一下，但那些信息有点旧。

Answer 1

This was a bug in the Leaky neuron that kept resetting its device when using DataParallel.这是 Leaky 神经元中的一个错误，它在使用 DataParallel 时不断重置其设备。 It has been fixed in the current version of snnTorch in GitHub, and addressed in this issue: https://github.com/jeshraghian/snntorch/issues/154已在GitHub当前版本的snnTorch中修复，本期解决： https://github.com/jeshraghian/snntorch/issues/154

We're working on fixing up the other neurons now.我们现在正在努力修复其他神经元。

使用 Multi GPU 环境进行火炬训练

问题描述

1 个解决方案

解决方案1
-1 2022-12-10 20:35:14

使用 Multi GPU 环境进行火炬训练

问题描述

1 个解决方案

解决方案1 -1 2022-12-10 20:35:14

解决方案1
-1 2022-12-10 20:35:14