简体   繁体   English

gensim word2vec访问进/出向量

[英]gensim word2vec accessing in/out vectors

In the word2vec model, there are two linear transforms that take a word in vocab space to a hidden layer (the "in" vector), and then back to the vocab space (the "out" vector). 在word2vec模型中,有两个线性变换,它们将词汇空间中的单词带到隐藏层(“in”向量),然后返回到词汇空间(“out”向量)。 Usually this out vector is discarded after training. 通常这个out向量在训练后被丢弃。 I'm wondering if there's an easy way of accessing the out vector in gensim python? 我想知道是否有一种简单的方法来访问gensim python中的out向量? Equivalently, how can I access the out matrix? 同样,我如何访问out矩阵?

Motivation: I would like to implement the ideas presented in this recent paper: A Dual Embedding Space Model for Document Ranking 动机:我想实现最近这篇论文中提出的想法: 文档排名的双嵌入空间模型

Here are more details. 这里有更多细节。 From the reference above we have the following word2vec model: 从上面的参考文献中我们得到以下word2vec模型:

在此输入图像描述

Here, the input layer is of size $V$, the vocabulary size, the hidden layer is of size $d$, and an output layer of size $V$. 这里,输入层的大小为$ V $,词汇量大小,隐藏层大小为$ d $,输出层大小为$ V $。 The two matrices are W_{IN} and W_{OUT}. 两个矩阵是W_ {IN}和W_ {OUT}。 Usually , the word2vec model keeps only the W_IN matrix. 通常 ,word2vec模型仅保留W_IN矩阵。 This is what is returned where, after training a word2vec model in gensim, you get stuff like: 这是返回的地方,在gensim中训练word2vec模型后,你会得到如下内容:

model['potato']=[-0.2,0.5,2,...] 模型[ '土豆'] = [ - 0.2,0.5,2,...]

How can I access, or retain W_{OUT}? 如何访问或保留W_ {OUT}? This is likely quite computationally expensive, and I'm really hoping for some built in methods in gensim to do this because I'm afraid that if I code this from scratch, it would not give good performance. 这可能在计算上非常昂贵,而且我真的希望在gensim中使用一些内置方法来执行此操作,因为我担心如果我从头开始编写代码,它就不会提供良好的性能。

While this might not be a proper answer (can't comment yet) and noone pointed this out, take a look here . 虽然这可能不是一个正确的答案(无法评论),没有人指出这一点,看看这里 The creator seems to answer a similar question. 创作者似乎回答了类似的问题。 Also that's the place where you have a higher chance for a valid answer. 这也是你有更高机会获得有效答案的地方。

Digging around in the link he posted in the word2vec source code you could change the syn1 deletion to suit your needs. 在word2vec源代码中发布的链接中,您可以更改syn1删除以满足您的需求。 Just remember to delete it after you're done, since it proves to be a memory hog. 记得在完成之后删除它,因为它被证明是一种记忆力。

Below code will enable to save/load model. 下面的代码将启用保存/加载模型。 It uses pickle internally, optionally mmap'ing the model's internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing. 它内部使用pickle,可选择将模型的内部大型NumPy矩阵直接从磁盘文件映射到虚拟内存中,以进行进程间内存共享。

model.save('/tmp/mymodel.model')
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')

Some background information Gensim is a free Python library designed to process raw, unstructured digital texts (“plain text”). 一些背景信息 Gensim是一个免费的Python库,用于处理原始的非结构化数字文本(“纯文本”)。 The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation and Random Projections discover semantic structure of documents by examining statistical co-occurrence patterns of the words within a corpus of training documents. gensim中的算法,例如潜在语义分析,潜在Dirichlet分配和随机预测,通过检查训练文档语料库中单词的统计共现模式来发现文档的语义结构。

Some good blog describing about the use and sample code base to kick start on the project 一些很好的博客描述了使用和示例代码库来启动项目

Installation reference here 这里安装参考

Hope this helps!!! 希望这可以帮助!!!

In the word2vec.py file you need to make this change In the following function it currently returns the "in" vector. 在word2vec.py文件中,您需要进行此更改在以下函数中,它当前返回“in”向量。 As you want the "out" vector. 因为你想要“出”矢量。 The "in" is saved in syn0 object and "out" is saved in syn1neg object variable. “in”保存在syn0对象中,“out”保存在syn1neg对象变量中。

def save_word2vec_format(self, fname, fvocab=None, binary=False):
  ....
  ....
  row = self.syn1neg[vocab.index]

To get the syn1 of any word, this might work. 要获得任何单词的syn1,这可能会有效。

model.syn1[model.wv.vocab['potato'].point]

where model is your trained word2vec model. model是你训练过的word2vec模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM