After training a word2vec model using python gensim , how do you find the number of words in the model's vocabulary?
The vocabulary is in the vocab
field of the Word2Vec model's wv
property, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:
len(w2v_model.wv.vocab)
(In older gensim versions before 0.13, vocab
appeared directly on the model. So you would use w2v_model.vocab
instead of w2v_model.wv.vocab
.)
Gojomo's answer raises an AttributeError
for Gensim 4.0.0+.
For these versions, you can get the length of the vocabulary as follows:
len(w2v_model.wv.index_to_key)
(which is slightly faster than: len(w2v_model.wv.key_to_index)
)
One more way to get the vocabulary size is from the embedding matrix itself as in:
In [33]: from gensim.models import Word2Vec
# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)
# get the shape of embedding matrix
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)
# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109
Latest:
Use model.wv.key_to_index, after creating gensim model
vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.