[英]Loss Function in Multi-GPUs training (PyTorch)
I use Pytorch and BERT to traing a model.我使用 Pytorch 和 BERT 来训练 model。 Everithing works great on one GPU, but when I try to use multi GPUs I am getting an error:一切都在一个 GPU 上运行良好,但是当我尝试使用多 GPU 时出现错误:
ValueError Traceback (most recent call last)
<ipython-input-168-507223f9879c> in <module>()
92 # single value; the `.item()` function just returns the Python value
93 # from the tensor.
---> 94 total_loss += loss.item()
95
96 # Perform a backward pass to calculate the gradients.
ValueError: only one element tensors can be converted to Python scalars
Can someone help me what I am missing and how I should fix it?有人可以帮助我我缺少什么以及我应该如何解决它?
Here is my code for training:这是我的培训代码:
import random
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
loss_values = []
for epoch_i in range(0, epochs):
t0 = time.time()
total_loss = 0
for step, batch in enumerate(train_dataloader):
if step % 40 == 0 and not step == 0:
elapsed = format_time(time.time() - t0)
b_input_ids = batch[0].to(device).long()
b_input_mask = batch[1].to(device).long()
b_labels = batch[2].to(device).long()
model.zero_grad()
outputs = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
loss = outputs[0]
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
avg_train_loss = total_loss / len(train_dataloader)
loss_values.append(avg_train_loss)
print("")
print(" Average training loss: {0:.2f}".format(avg_train_loss))
print(" Training epcoh took: {:}".format(format_time(time.time() - t0)))
And here is my code for the model:这是我的 model 代码:
from transformers import BertForSequenceClassification, AdamW, BertConfig
model_to_parallel = BertForSequenceClassification.from_pretrained(
"./bert_cache.zip",
num_labels = 2,
output_attentions = False,
output_hidden_states = False,
)
model = nn.DataParallel(model_to_parallel, device_ids=[0,1,2,3])
model.to(device)
After loss loss = outputs[0]
the loss
is a multi-element tensor, the size is number of GPUs.在 loss loss = outputs[0]
之后, loss
是一个多元素张量,大小是 GPU 的数量。
You can use loss = outputs[0].mean()
instead.您可以改用loss = outputs[0].mean()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.