简体   繁体   中英

How to combine two tokenized bert sequences

Say I have two tokenized BERT sequences:

seq1 = tensor([[ 101,  2023,  2003,  1996, 23032,   102]])
seq2 = tensor([[ 101, 2023, 2003, 6019, 1015,  102]])

This is produced with huggingface's tokenizer:

seq = torch.tensor(tokenizer.encode(text=query, add_special_tokens=True)).unsqueeze(0)

What is the best way to combined the tokenized sequences to get one final sequence, where the [sep] tokens are auto-incremented?

For example:

combined = tensor([[ 101,  2023,  2003,  1996, 23032,   102,  2023,  2003,  6019,  1015,
           102]])

It seems like I should loop through and increment the special tokens but that also seems hacky.

There are several options to achieve what you are looking for. You could for example use the test_pair input of the tokenizer in case you can work with the strings directly. You can also concatenates the tensors with torch.cat . Please have a look at the example below:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

query1= 'hello stackoverflow'
query2= 'hello huggingface'
#creating an input pair with the original strings
print(tokenizer.encode(text = query1, text_pair=query2, return_tensors='pt'))

seq1 = tokenizer.encode(text=query1, return_tensors='pt')
seq2  = tokenizer.encode(text=query2, return_tensors='pt')
#concatenating existing tensors 
print(torch.cat((seq1, seq2[:, 1:]),dim=1))

Output:

tensor([[  101,  7592,  9991,  7840, 12314,   102,  7592, 17662, 12172,   102]])
tensor([[  101,  7592,  9991,  7840, 12314,   102,  7592, 17662, 12172,   102]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM