简体   繁体   English

加载预先计算的向量Gensim

[英]Load PreComputed Vectors Gensim

I am using the Gensim Python package to learn a neural language model, and I know that you can provide a training corpus to learn the model. 我使用Gensim Python包来学习神经语言模型,我知道你可以提供一个训练语料库来学习模型。 However, there already exist many precomputed word vectors available in text format (eg http://www-nlp.stanford.edu/projects/glove/ ). 然而,已经存在许多以文本格式可用的预计算单词向量(例如http://www-nlp.stanford.edu/projects/glove/ )。 Is there some way to initialize a Gensim Word2Vec model that just makes use of some precomputed vectors, rather than having to learn the vectors from scratch? 有没有办法初始化一个只使用一些预先计算的向量的Gensim Word2Vec模型,而不是从头开始学习向量?

Thanks! 谢谢!

The GloVe dump from the Stanford site is in a format that is little different from the word2vec format. 来自斯坦福站点的GloVe转储的格式与word2vec格式略有不同。 You can convert the GloVe file into word2vec format using: 您可以使用以下命令将GloVe文件转换为word2vec格式:

python -m gensim.scripts.glove2word2vec --input  glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt

You can download pre-trained word vectors from here (get the file 'GoogleNews-vectors-negative300.bin'): word2vec 您可以从这里下载预先训练过的单词向量(获取文件'GoogleNews-vectors-negative300.bin'): word2vec

Extract the file and then you can load it in python like: 解压缩文件然后你可以在python中加载它:

model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)

model.most_similar('dog')

EDIT (May 2017): As the above code is now deprecated, this is how you'd load the vectors now: 编辑(2017年5月):由于上面的代码现已弃用,现在就是你加载向量的方法:

model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)

As far as I know, Gensim can load two binary formats, word2vec and fastText, and a generic plain text format which can be created by most word embedding tools. 据我所知,Gensim可以加载两种二进制格式,word2vec和fastText,以及可以由大多数字嵌入工具创建的通用纯文本格式。 The generic plain text format looks like this (in this example 20000 is the size of the vocabulary and 100 is the length of vector) 通用纯文本格式如下所示(在此示例中,20000是词汇表的大小,100是向量的长度)

20000 100
the 0.476841 -0.620207 -0.002157 0.359706 -0.591816 [98 more numbers...]
and 0.223408  0.231993 -0.231131 -0.900311 -0.225111 [98 more numbers..]
[19998 more lines...]

Chaitanya Shivade has explained in his answer here, how to use a script provided by Gensim to convert the Glove format (each line: word + vector) into the generic format. Chaitanya Shivade在他的答案中解释了如何使用Gensim提供的脚本将Glove格式(每行:word + vector)转换为通用格式。

Loading the different formats is easy, but it is also easy to get them mixed up: 加载不同的格式很简单,但也很容易混淆:

import gensim
model_file = path/to/model/file

1) Loading binary word2vec 1)加载二进制word2vec

model = gensim.models.word2vec.Word2Vec.load_word2vec_format(model_file)

2) Loading binary fastText 2)加载二进制fastText

model = gensim.models.fasttext.FastText.load_fasttext_format(model_file)

3) Loading the generic plain text format (which has been introduced by word2vec) 3)加载通用纯文本格式(由word2vec引入)

model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)

If you only plan to use the word embeddings and not to continue to train them in Gensim, you may want to use the KeyedVector class. 如果您只打算使用嵌入一词而不是继续在Gensim中训练它们,您可能需要使用KeyedVector类。 This will reduce the amount of memory you need to load the vectors considerably ( detailed explanation ). 这将减少大量加载向量所需的内存量( 详细说明 )。

The following will load the binary word2vec format as keyedvectors: 以下将加载二进制word2vec格式作为keyedvectors:

model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file, binary=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM