Pytorch - 在 GPU 上训练时，在设备 1 上的副本 1 中捕获 StopIteration 错误

Question

I am trying to train a BertPunc model on the train2012 data used in the git link: https://github.com/nkrnrnk/BertPunc .我正在尝试在 git 链接中使用的 train2012 数据上训练 BertPunc model： https://github.com/nkrnrnk/ While running on the server, with 4 GPUs enabled, below is the error I get:在服务器上运行时，启用了 4 个 GPU，以下是我得到的错误：

StopIteration: Caught StopIteration in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/stenoaimladmin/notebooks/model_BertPunc.py", line 16, in forward
    x = self.bert(x)
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/stenoaimladmin/anaconda3/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py", line 861, in forward
    sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask,
  File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/stenoaimladmin/anaconda3/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py", line 727, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

From the link: https://github.com/huggingface/transformers/issues/8145 , this appears to be happening when the data gets moved back and forth between multiple GPUs.从链接： https://github.com/huggingface/transformers/issues/8145 ，当数据在多个 GPU 之间来回移动时，这似乎发生了。

As per the git link: https://github.com/interpretml/interpret-text/issues/117 , we need to downgrade PyTorch version to 1.4 from 1.7 which I use currently.根据 git 链接： https://github.com/interpretml/interpret-text/issues/117 ，我们需要将 Z95B88F180E9EB5678E0F9EB88F180E9EB5678E0F9EB2CBE.73 版本从当前使用的版本降级为 14AC2CBE.73I 版本。 For me downgrading the version isnt an option as I have other scripts that use Torch 1.7 version.对我来说，降级版本不是一个选项，因为我有其他使用 Torch 1.7 版本的脚本。 What should I do to overcome this error?我应该怎么做才能克服这个错误？

I cant put the whole code here as there are too many lines, but here is the snippet that gives me the error:由于行太多，我无法将整个代码放在这里，但这是给我错误的代码段：

bert_punc, optimizer, best_val_loss = train(bert_punc, optimizer, criterion, epochs_top, 
        data_loader_train, data_loader_valid, save_path, punctuation_enc, iterations_top, best_val_loss=1e9)

Here is my DataParallel code:这是我的 DataParallel 代码：

   bert_punc = nn.DataParallel(BertPunc(segment_size, output_size, dropout)).cuda()

I tried changing the Dataparallel line to divert the training to only 1 GPU, out of 4 present.我尝试更改 Dataparallel 线以将训练转移到仅 1 个 GPU，共 4 个。 But that gave me a space issue, and hence had to revert the code back to default.但这给了我一个空间问题，因此不得不将代码恢复为默认值。

Here is the link to all scripts that I am using: https://github.com/nkrnrnk/BertPunc Please advice.这是我正在使用的所有脚本的链接： https://github.com/nkrnrnk/BertPunc请指教。

Answer 1

change改变

extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility

to至

extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) # fp16 compatibility

For more details, see https://github.com/vid-koci/bert-commonsense/issues/6有关更多详细信息，请参阅https://github.com/vid-koci/bert-commonsense/issues/6

Answer 2

I second Xiaoou wang answer.我第二次小欧王回答。

Just adding the path of the file needed to update in my env for better clarity只需在我的环境中添加需要更新的文件的路径，以获得更好的清晰度

"/data/home/cohnstav/anaconda3/envs/BestEnv/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py" “/data/home/cohnstav/anaconda3/envs/BestEnv/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py”

Pytorch - 在 GPU 上训练时，在设备 1 上的副本 1 中捕获 StopIteration 错误

问题描述

2 个解决方案

解决方案1
7 已采纳 2021-04-27 10:44:50

解决方案2
0 2022-06-16 11:16:03

Pytorch - 在 GPU 上训练时，在设备 1 上的副本 1 中捕获 StopIteration 错误

问题描述

2 个解决方案

解决方案1 7 已采纳 2021-04-27 10:44:50

解决方案2 0 2022-06-16 11:16:03

解决方案1
7 已采纳 2021-04-27 10:44:50

解决方案2
0 2022-06-16 11:16:03