How do I solve the ValueError: only one element tensors can be converted to Python scalars ?
I am closely following a tutorial on building a Question Answering Bot in PyTorch . However, at training, my code is unable to save the checkpoints, giving me aforementioned ValueError. The error happens at torch.save(torch.tensor(train_loss_set), os.path.join(output_dir, 'training_loss.pt'))
Below is my code corresponding to the train iterator:
num_train_epochs = 1
print("***** Running training *****")
print(" Num examples = %d" % len(dataset))
print(" Num Epochs = %d" % num_train_epochs)
print(" Batch size = %d" % batch_size)
print(" Total optimization steps = %d" % (len(train_dataloader) // num_train_epochs))
model.zero_grad()
train_iterator = trange(num_train_epochs, desc="Epoch")
set_seed()
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration")
for step, batch in enumerate(epoch_iterator):
if step < global_step + 1:
continue
model.train()
batch = tuple(t.to(device) for t in batch)
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
'token_type_ids': batch[2],
'start_positions': batch[3],
'end_positions': batch[4]}
outputs = model(**inputs)
loss = outputs[0]
train_loss_set.append(loss)
loss.sum().backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
tr_loss += loss.sum().item()
optimizer.step()
model.zero_grad()
global_step += 1
if global_step % 1000 == 0:
print("Train loss: {}".format(tr_loss/global_step))
output_dir = 'checkpoints/checkpoint-{}'.format(global_step)
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
torch.save(torch.tensor(train_loss_set), os.path.join(output_dir, 'training_loss.pt'))
print("Saving model checkpoint to %s" % output_dir)
Edit print(train_loss_set[:10])
returns the following:
[tensor([5.7099, 5.7395], device='cuda:0', grad_fn=<GatherBackward>), tensor([5.2470, 5.4016], device='cuda:0', grad_fn=<GatherBackward>), tensor([5.1311, 5.0390], device='cuda:0', grad_fn=<GatherBackward>), tensor([4.4326, 4.8475], device='cuda:0', grad_fn=<GatherBackward>), tensor([3.4740, 3.9955], device='cuda:0', grad_fn=<GatherBackward>), tensor([4.8710, 4.5907], device='cuda:0', grad_fn=<GatherBackward>), tensor([4.4294, 4.3013], device='cuda:0', grad_fn=<GatherBackward>), tensor([2.7536, 2.9540], device='cuda:0', grad_fn=<GatherBackward>), tensor([3.8989, 3.3436], device='cuda:0', grad_fn=<GatherBackward>), tensor([3.3534, 3.2532], device='cuda:0', grad_fn=<GatherBackward>)]
Could this have to do with the fact that I'm using DataParallel?
It's a weird behavior of pytorch.
Basically you can't create a Tensor using a list(s) of Tensors.
But there's 3 things you can do.
torch.tensor
when saving a list of tensors so this should work.torch.save(train_loss_set, os.path.join(output_dir, 'training_loss.pt'))
torch.stack
instead.torch.save(torch.stack(train_loss_set), os.path.join(output_dir, 'training_loss.pt'))
ndarray
. And you can use torch.tensor
train_loss_set.append(loss.cpu().detach().numpy())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.