How to combine two tokenized bert sequences

Question

Say I have two tokenized BERT sequences:

seq1 = tensor([[ 101,  2023,  2003,  1996, 23032,   102]])
seq2 = tensor([[ 101, 2023, 2003, 6019, 1015,  102]])

This is produced with huggingface's tokenizer:

seq = torch.tensor(tokenizer.encode(text=query, add_special_tokens=True)).unsqueeze(0)

What is the best way to combined the tokenized sequences to get one final sequence, where the [sep] tokens are auto-incremented?

For example:

combined = tensor([[ 101,  2023,  2003,  1996, 23032,   102,  2023,  2003,  6019,  1015,
           102]])

It seems like I should loop through and increment the special tokens but that also seems hacky.

Answer 1

There are several options to achieve what you are looking for. You could for example use the test_pair input of the tokenizer in case you can work with the strings directly. You can also concatenates the tensors with torch.cat . Please have a look at the example below:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

query1= 'hello stackoverflow'
query2= 'hello huggingface'
#creating an input pair with the original strings
print(tokenizer.encode(text = query1, text_pair=query2, return_tensors='pt'))

seq1 = tokenizer.encode(text=query1, return_tensors='pt')
seq2  = tokenizer.encode(text=query2, return_tensors='pt')
#concatenating existing tensors 
print(torch.cat((seq1, seq2[:, 1:]),dim=1))

Output:

tensor([[  101,  7592,  9991,  7840, 12314,   102,  7592, 17662, 12172,   102]])
tensor([[  101,  7592,  9991,  7840, 12314,   102,  7592, 17662, 12172,   102]])

How to combine two tokenized bert sequences

Question

1 answers

solution1
2 2020-08-03 21:58:51

How to combine two tokenized bert sequences

Question

1 answers

solution1 2 2020-08-03 21:58:51

solution1
2 2020-08-03 21:58:51