简体   繁体   中英

How to set vocabulary size in python tokenizers library?

I would like to fit my own tokenizer and use it further for the pre-trained model, however, when fitting a new tokenizer there seems to be no way to choose the vocabulary size. So when I call tokenizer.get_vocab() it always returns a dictionary with 30000 elements. How do I change that? Here is what I do:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(['transcripts.raw'], trainer) # Here there are no additional arguments for some reason

What you can do is use the vocab_size parameter of the BpeTrainer , which is set by default to 30000:

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=10)

For more information, you can check out the docs .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM