简体   繁体   中英

gensim word2vec: Find number of words in vocabulary

After training a word2vec model using python gensim , how do you find the number of words in the model's vocabulary?

The vocabulary is in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:

len(w2v_model.wv.vocab)

(In older gensim versions before 0.13, vocab appeared directly on the model. So you would use w2v_model.vocab instead of w2v_model.wv.vocab .)

Gojomo's answer raises an AttributeError for Gensim 4.0.0+.

For these versions, you can get the length of the vocabulary as follows:

len(w2v_model.wv.index_to_key)

(which is slightly faster than: len(w2v_model.wv.key_to_index) )

One more way to get the vocabulary size is from the embedding matrix itself as in:

In [33]: from gensim.models import Word2Vec

# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)

# get the shape of embedding matrix    
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)

# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109

Latest:

Use model.wv.key_to_index, after creating gensim model

vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM