简体   繁体   中英

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! when using transformers architecture

I am having a multi-gpu problem while practicing transformer through pytorch.All the training previously studied using pytorch was possible just by putting nn.dataparallel on the model object.However, this method worked fine until seq2seq, but the transformer returns the following error:

RuntimeError                              Traceback (most recent call last)
Cell In [44], line 66
     63 for epoch in range(N_EPOCHS):
     64     start_time = time.time() # 시작 시간 기록
---> 66     train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
     67     valid_loss = evaluate(model, validation_iterator, criterion)
     69     end_time = time.time() # 종료 시간 기록

Cell In [41], line 15, in train(model, iterator, optimizer, criterion, clip)
     11 optimizer.zero_grad()
     13 # 출력 단어의 마지막 인덱스(<eos>)는 제외
     14 # 입력을 할 때는 <sos>부터 시작하도록 처리
---> 15 output, _ = model(src, trg[:,:-1])
     17 # output: [배치 크기, trg_len - 1, output_dim]
     18 # trg: [배치 크기, trg_len]
     20 output_dim = output.shape[-1]

File ~/anaconda3/envs/jki_pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
...
    return forward_call(*input, **kwargs)
  File "/tmp/ipykernel_212252/284771533.py", line 31, in forward
    src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Currently, the device is set to cuda, and nn.dataparallel was applied only to the final transformer model except for the encoder and decoder.

# 인코더(encoder)와 디코더(decoder) 객체 선언
enc = Encoder(INPUT_DIM, HIDDEN_DIM, ENC_LAYERS, ENC_HEADS, ENC_PF_DIM, ENC_DROPOUT, device)
dec = Decoder(OUTPUT_DIM, HIDDEN_DIM, DEC_LAYERS, DEC_HEADS, DEC_PF_DIM, DEC_DROPOUT, device)

# Transformer 객체 선언 및 병렬처리
model = nn.DataParallel(Transformer(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device))

I tried nn.dataparallel on encoder and decoder objects, but it still returns the same error. Has anyone had the same error as me? How did you solve it? I am using two 2080ti and the device values are as follows.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

>>> cuda

Due to the current memory issue, the batch size has become very small, which inevitably lowers learning performance and time. I look forward to your help

This happens when (like the error says) you have two arguments on different GPUs.

It's hard to know what's the exact problem without looking at the full code, but I'd recommend this:

  1. try to run your code on a single GPU. just past this at the beginning of your code (before any imports):

    import os os.environ['CUDA_VISIBLE_DEVICES'] ='0'

  2. make sure you use.cuda() and not.device() on all tensors and models, and that you don't send any tensor to different devices. the dataparallel will handle the rest:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM