gensim word2vec: Find number of words in vocabulary

Question

After training a word2vec model using python gensim , how do you find the number of words in the model's vocabulary?

Answer 1

The vocabulary is in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:

len(w2v_model.wv.vocab)

(In older gensim versions before 0.13, vocab appeared directly on the model. So you would use w2v_model.vocab instead of w2v_model.wv.vocab .)

Answer 2

Gojomo's answer raises an AttributeError for Gensim 4.0.0+.

For these versions, you can get the length of the vocabulary as follows:

len(w2v_model.wv.index_to_key)

(which is slightly faster than: len(w2v_model.wv.key_to_index) )

Answer 3

One more way to get the vocabulary size is from the embedding matrix itself as in:

In [33]: from gensim.models import Word2Vec

# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)

# get the shape of embedding matrix    
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)

# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109

Answer 4

Latest:

Use model.wv.key_to_index, after creating gensim model

vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes

gensim word2vec: Find number of words in vocabulary

Question

4 answers

solution1
64 ACCPTED 2016-02-26 00:58:06

solution2
4 2021-06-22 12:22:12

solution3
1 2019-02-26 18:27:48

solution4
0 2022-07-26 16:46:21

gensim word2vec: Find number of words in vocabulary

Question

4 answers

solution1 64 ACCPTED 2016-02-26 00:58:06

solution2 4 2021-06-22 12:22:12

solution3 1 2019-02-26 18:27:48

solution4 0 2022-07-26 16:46:21

solution1
64 ACCPTED 2016-02-26 00:58:06

solution2
4 2021-06-22 12:22:12

solution3
1 2019-02-26 18:27:48

solution4
0 2022-07-26 16:46:21