How to set vocabulary size in python tokenizers library?

Question

I would like to fit my own tokenizer and use it further for the pre-trained model, however, when fitting a new tokenizer there seems to be no way to choose the vocabulary size. So when I call tokenizer.get_vocab() it always returns a dictionary with 30000 elements. How do I change that? Here is what I do:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(['transcripts.raw'], trainer) # Here there are no additional arguments for some reason

Answer 1

What you can do is use the vocab_size parameter of the BpeTrainer , which is set by default to 30000:

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=10)

For more information, you can check out the docs .

How to set vocabulary size in python tokenizers library?

Question

1 answers

solution1
0 2021-11-01 19:17:00

How to set vocabulary size in python tokenizers library?

Question

1 answers

solution1 0 2021-11-01 19:17:00

solution1
0 2021-11-01 19:17:00