BERT 獲取句子嵌入

Question

我正在從這個頁面復制代碼。 我已將 BERT 模型下載到我的本地系統並獲得句子嵌入。

我有大約 500,000 個句子需要句子嵌入，這需要很多時間。

有沒有辦法加快這個過程？
發送成批的句子而不是一次發送一個句子會有幫助嗎？

.

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

corpa=["i am a boy","i live in a city"]



storage=[]#list to store all embeddings

for text in corpa:
    # Add the special tokens.
    marked_text = "[CLS] " + text + " [SEP]"

    # Split the sentence into tokens.
    tokenized_text = tokenizer.tokenize(marked_text)

    # Map the token strings to their vocabulary indeces.
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    segments_ids = [1] * len(tokenized_text)

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)

    storage.append((text,sentence_embedding))

######更新1

我根據提供的答案修改了我的代碼。 它沒有進行完整的批處理

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)


storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
    
    tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
    segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
    print (tokens_tensor)
    print (segments_tensors)

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    print (sentence_embedding[:10])
    storage.append((text,sentence_embedding))

我可以將 for 循環中的前 2 行更新到下面。 但它們只有在標記化后所有句子的長度都相同時才有效

tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])

此外，在這種情況下， outputs = model(tokens_tensor, segments_tensors)失敗。

在這種情況下，我如何才能完全執行批處理？

Answer 1

可以加速您的工作流程的最簡單方法之一是批量數據處理。 在當前的實現中，您在每次迭代中只提供一個句子，但可以使用批處理數據！

現在，如果您願意自己實現這部分，我強烈建議您以這種方式使用tokenizer來准備數據。

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
               [101, 1262, 1330, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]]}

但是有一個更簡單的方法，使用FeatureExtractionPipeline和全面的文檔！ 這看起來像這樣：

from transformers import pipeline

feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction(["Hello I'm a single sentence",
                               "And another sentence",
                               "And the very very last one"])

UPDATE實際上，您稍微更改了代碼，但您一次只傳遞一個樣本，而不是以批處理形式傳遞。 如果我們想堅持你的實現批處理會是這樣的：

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )
model.eval()
sentences = [ 
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
            ]
batch_size = 4  
for idx in range(0, len(sentences), batch_size):
    batch = sentences[idx : min(len(sentences), idx+batch_size)]
    
    # encoded = tokenizer(batch)
    encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
  
    encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
    with torch.no_grad():
        
        outputs = model(**encoded)
        
    
    print(outputs.last_hidden_state.size())

輸出：

torch.Size([4, 50, 768]) # batch_size * max_length * hidden dim
torch.Size([4, 50, 768])
torch.Size([1, 50, 768])

Answer 2

關於您最初的問題：您無能為力。 BERT 是計算要求很高的算法。 最好的辦法是使用BertTokenizerFast而不是常規的BertTokenizer 。 “快速”版本效率更高，您將看到大量文本的差異。

話雖如此，我必須警告您，平均 BERT 詞嵌入並不能為句子創建好的嵌入。 看到這個帖子。 根據您的問題，我假設您想做某種語義相似性搜索。 嘗試使用其中一種開源模型。

BERT 獲取句子嵌入

問題描述

2 個解決方案

解決方案1
1 2021-10-11 05:55:12

解決方案2
0 2021-10-19 11:23:45

BERT 獲取句子嵌入

問題描述

2 個解決方案

解決方案1 1 2021-10-11 05:55:12

解決方案2 0 2021-10-19 11:23:45

解決方案1
1 2021-10-11 05:55:12

解決方案2
0 2021-10-19 11:23:45