[英]Fast BERT Camembert, arguments located on different GPUs error
I'm trying to implement Fast-Bert with 'camembert-base' model type.我正在尝试使用“camembert-base”model 类型实现 Fast-Bert。 I can easily create my databunch with BertLMDatabunch.From raw_corpus then I create the learner.我可以使用 BertLMDatabunch 轻松创建我的数据包。从 raw_corpus 然后我创建学习器。 I'm on a cloud env, with 3 GPUs, on ubuntu env, with 32 cores and RAM 130Mo.我在具有 3 个 GPU 的云环境中,在 ubuntu 环境中,具有 32 个内核和 RAM 130Mo。 When I'm trying to fit the model I always have this error message after this information当我尝试安装 model 时,我总是在此信息之后收到此错误消息
databunch_lm = BertLMDataBunch.from_raw_corpus(
data_dir=DATA_PATH,
text_list=all_texts,
tokenizer='camembert-base',
batch_size_per_gpu=16,
max_seq_length=512,
multi_gpu=True,
model_type='camembert-base',
logger=logger)
lm_learner = BertLMLearner.from_pretrained_model(
dataBunch=databunch_lm,
pretrained_path='camembert-base',
output_dir=MODEL_PATH,
metrics=[],
device=device_cuda,
logger=logger,
multi_gpu=True,
logging_steps=50,
fp16_opt_level="O2")
lm_learner.fit(epochs=30,
lr=1e-4,
validate=True,
schedule_type="warmup_cosine",
optimizer_type="adamw")
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
0.00% [0/30 00:00<00:00]
0.00% [0/32 00:00<00:00]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-15-b8d5301e0d4e> in <module>
3 validate=True,
4 schedule_type="warmup_cosine",
----> 5 optimizer_type="adamw")
~/miniconda3/lib/python3.7/site-packages/fast_bert/learner_lm.py in fit(self, epochs, lr, validate, schedule_type, optimizer_type)
142 self.model.train()
143
--> 144 outputs = self.model(inputs, masked_lm_labels=labels)
145 loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
146
~/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
545 result = self._slow_forward(*input, **kwargs)
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
549 hook_result = hook(self, input, result)
~/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
150 return self.module(*inputs[0], **kwargs[0])
151 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 152 outputs = self.parallel_apply(replicas, inputs, kwargs)
153 return self.gather(outputs, self.output_device)
154
~/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
160
161 def parallel_apply(self, replicas, inputs, kwargs):
--> 162 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
163
164 def gather(self, outputs, output_device):
~/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
83 output = results[i]
84 if isinstance(output, ExceptionWrapper):
---> 85 output.reraise()
86 outputs.append(output)
87 return outputs
~/miniconda3/lib/python3.7/site-packages/torch/_utils.py in reraise(self)
367 # (https://bugs.python.org/issue2651), so we work around it.
368 msg = KeyErrorMessage(msg)
--> 369 raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/apex-0.1-py3.7.egg/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/transformers/modeling_roberta.py", line 231, in forward
inputs_embeds=inputs_embeds,
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/transformers/modeling_bert.py", line 727, in forward
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/transformers/modeling_roberta.py", line 66, in forward
input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/transformers/modeling_bert.py", line 174, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1467, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/generic/THCTensorIndex.cu:397
"apex was installed without --cuda_ext" Try running this line: “在没有 --cuda_ext 的情况下安装了 apex”尝试运行这一行:
%pip install -v --no-cache-dir --global-option="--pyprof" --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
I'm using the same model without problems.我正在使用相同的 model 没有问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.