简体   繁体   English

gensim word2vec:查找词汇表中的单词数

[英]gensim word2vec: Find number of words in vocabulary

After training a word2vec model using python gensim , how do you find the number of words in the model's vocabulary?在使用 python gensim训练一个 word2vec model 之后,如何找到模型词汇表中的单词数?

The vocabulary is in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). 词汇表在Word2Vec模型的wv属性的vocab字段中,作为字典,其中键是每个标记(单词)。 So it's just the usual Python for getting a dictionary's length: 所以它只是通常的Python获取字典的长度:

len(w2v_model.wv.vocab)

(In older gensim versions before 0.13, vocab appeared directly on the model. So you would use w2v_model.vocab instead of w2v_model.wv.vocab .) (在0.13之前的较旧gensim版本中, vocab直接出现在模型上。因此,您将使用w2v_model.vocab而不是w2v_model.wv.vocab 。)

Gojomo's answer raises an AttributeError for Gensim 4.0.0+. Gojomo 的回答为 Gensim 4.0.0+ 引发了一个AttributeError

For these versions, you can get the length of the vocabulary as follows:对于这些版本,您可以按如下方式获取词汇表的长度:

len(w2v_model.wv.index_to_key)

(which is slightly faster than: len(w2v_model.wv.key_to_index) ) (略快于: len(w2v_model.wv.key_to_index)

One more way to get the vocabulary size is from the embedding matrix itself as in: 获取词汇量大小的另一种方法是嵌入矩阵本身,如:

In [33]: from gensim.models import Word2Vec

# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)

# get the shape of embedding matrix    
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)

# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109

Latest:最新的:

Use model.wv.key_to_index, after creating gensim model使用model.wv.key_to_index,创建gensim后model

vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes vocab dict 成为 key_to_index 用于查找键的 integer 索引,或 get_vecattr() 和 set_vecattr() 用于其他每个键的属性: https://github.com/RaRe-Technologies/gensim/wiki/Migrating .x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM