张量 (1011) 的扩展大小必须与非单维 1 处的现有大小 (512) 匹配

Question

我从 huggingface 训练了一个 LayoutLMv2 model，当我尝试在单个图像上推断它时，它会给出运行时错误。 代码如下：

query = '/Users/vaihabsaxena/Desktop/Newfolder/labeled/Others/Two.pdf26.png'
image = Image.open(query).convert("RGB")
encoded_inputs = processor(image, return_tensors="pt").to(device)
outputs = model(**encoded_inputs)
preds = torch.softmax(outputs.logits, dim=1).tolist()[0]
pred_labels = {label:pred for label, pred in zip(label2idx.keys(), preds)}
pred_labels

当我执行model(**encoded_inputs)时出现错误。 processor被称为 Huggingface 的目录，并与其他 API 一起初始化如下：

feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

model 的定义和训练如下：

model = LayoutLMv2ForSequenceClassification.from_pretrained(
    "microsoft/layoutlmv2-base-uncased",  num_labels=len(label2idx)
)
model.to(device);


optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3


for epoch in range(num_epochs):
    print("Epoch:", epoch)
    training_loss = 0.0
    training_correct = 0
    #put the model in training mode
    model.train()
    for batch in tqdm(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss

        training_loss += loss.item()
        predictions = outputs.logits.argmax(-1)
        training_correct += (predictions == batch['labels']).float().sum()

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print("Training Loss:", training_loss / batch["input_ids"].shape[0])
    training_accuracy = 100 * training_correct / len(train_data)
    print("Training accuracy:", training_accuracy.item())  
        
    validation_loss = 0.0
    validation_correct = 0
    for batch in tqdm(valid_dataloader):
        outputs = model(**batch)
        loss = outputs.loss

        validation_loss += loss.item()
        predictions = outputs.logits.argmax(-1)
        validation_correct += (predictions == batch['labels']).float().sum()

    print("Validation Loss:", validation_loss / batch["input_ids"].shape[0])
    validation_accuracy = 100 * validation_correct / len(valid_data)
    print("Validation accuracy:", validation_accuracy.item())

完整的错误跟踪：

RuntimeError                              Traceback (most recent call last)
/Users/vaihabsaxena/Desktop/Newfolder/pytorch.ipynb Cell 37 in <cell line: 4>()
      2 image = Image.open(query).convert("RGB")
      3 encoded_inputs = processor(image, return_tensors="pt").to(device)
----> 4 outputs = model(**encoded_inputs)
      5 preds = torch.softmax(outputs.logits, dim=1).tolist()[0]
      6 pred_labels = {label:pred for label, pred in zip(label2idx.keys(), preds)}

File ~/opt/anaconda3/envs/env_pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/opt/anaconda3/envs/env_pytorch/lib/python3.9/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py:1071, in LayoutLMv2ForSequenceClassification.forward(self, input_ids, bbox, image, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
   1061 visual_position_ids = torch.arange(0, visual_shape[1], dtype=torch.long, device=device).repeat(
   1062     input_shape[0], 1
   1063 )
   1065 initial_image_embeddings = self.layoutlmv2._calc_img_embeddings(
   1066     image=image,
   1067     bbox=visual_bbox,
...
    896     input_shape[0], 1
    897 )
    898 final_position_ids = torch.cat([position_ids, visual_position_ids], dim=1)

RuntimeError: The expanded size of the tensor (1011) must match the existing size (512) at non-singleton dimension 1.  Target sizes: [1, 1011].  Tensor sizes: [1, 512]

我试图设置标记器以切断最大长度，但它发现编码输入为encoded_inputs但图像仍然存在。 这里出了什么问题？

Answer 1

错误消息告诉您，通过 ocr 提取的文本（1011 个令牌）比 model 能够处理的基础文本（512 个令牌）长。 根据您的任务，您也许可以使用分词器参数截断来截断您的文本（处理器会将此参数传递给分词器）：

import torch
from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor, LayoutLMv2ForSequenceClassification
from PIL import Image, ImageDraw, ImageFont

query = "/content/Screenshot_20220905_202551.png"
image = Image.open(query).convert("RGB")

feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
model = LayoutLMv2ForSequenceClassification.from_pretrained("microsoft/layoutlmv2-base-uncased",  num_labels=2)

encoded_inputs = processor(image, return_tensors="pt")
# Model will raise an error because the tensor is longer as the trained position embeddings
print(encoded_inputs["input_ids"].shape)
encoded_inputs = processor(image, return_tensors="pt", truncation=True)
print(encoded_inputs["input_ids"].shape)
outputs = model(**encoded_inputs)
preds = torch.softmax(outputs.logits, dim=1).tolist()[0]

Output：

torch.Size([1, 644])
torch.Size([1, 512])

对于此代码，我使用了以下屏幕截图：

张量 (1011) 的扩展大小必须与非单维 1 处的现有大小 (512) 匹配

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-05 18:42:45

张量 (1011) 的扩展大小必须与非单维 1 处的现有大小 (512) 匹配

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-05 18:42:45

解决方案1
0 已采纳 2022-09-05 18:42:45