简体   繁体   English

使用平均方法从word2vec单词向量计算句子向量的具体步骤是什么?

[英]What are the specifc steps for computing sentence vectors from word2vec word vectors using the averaging method?

Beginner question, but I am a bit puzzled by this. 初学者的问题,但对此我有些困惑。 Hope the answer to this question can benefit other beginners in NLP as well. 希望这个问题的答案也能使NLP的其他初学者受益。

Here are some more details: 以下是更多详细信息:

I know that you can compute sentence vectors from word vectors generated by word2vec. 我知道您可以从word2vec生成的单词向量中计算句子向量。 But what are the actual steps involved to make these sentence vectors. 但是制作这些句子向量需要涉及哪些实际步骤。 Can anyone provide a intuitive example and then some calculations to explain this process? 谁能提供一个直观的示例,然后再进行一些计算来解释这一过程?

eg: Suppose I have a sentence with three words: Today is hot. 例如:假设我有一句话用三个词:今天很热。 And suppose these words have hypothetical vector values of: (1,2,3)(4,5,6)(7,8,9). 并假设这些单词的假设向量值为:(1,2,3)(4,5,6)(7,8,9)。 Do I get the sentence vector by performing component-wise averaging of these word vectors? 我是否通过对这些单词向量进行分量平均来获得句子向量? And what if the vectors are of different length eg: (1,2)(4,5,6)(7,8,9,23,76) what does the averaging process look like for these cases? 如果矢量具有不同的长度,例如:(1,2)(4,5,6)(7,8,9,23,76),这些情况下的平均过程将如何?

Creating the vector for a length-of-text (sentence/paragraph/document) by averaging the word-vectors is one simple approach. 通过平均单词向量来为文本长度(句子/段落/文档)创建向量是一种简单的方法。 (It's not great at capturing shades-of-meaning, but it's easy to do.) (在捕获含义阴影时效果不佳,但是很容易做到。)

Using the gensim library, it can be as simple as: 使用gensim库,它可以很简单:

import numpy as np
from gensim.models.keyedvectors import KeyedVectors

wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
text = "the quick brown fox jumped over the lazy dog"
text_vector = np.mean([wv[word] for word in text.split()], axis=0)

Whether to use the raw word-vectors, or word-vectors that are either unit-normalized or otherwise weighted by some measure of word significance are alternatives to consider. 是否考虑使用原始单词向量,还是使用单位归一化的单词向量,还是通过某种程度的单词重要性度量加权的单词向量,都是可以考虑的选择。

Word-vectors that are compatible with each other will have the same number of dimensions, so there's never an issue of trying to average differently-sized vectors. 彼此兼容的字向量将具有相同数量的维数,因此从来没有尝试对大小不同的向量求平均的问题。

Other techniques like 'Paragraph Vectors' ( Doc2Vec in gensim) might give better text-vectors for some purposes, on some corpuses. 诸如“段落向量”( Doc2Vec中的Doc2Vec)之类的其他技术可能出于某些目的在某些语料上提供更好的文本向量。

Other techniques for comparing the similarity of texts that leverage word-vectors, like "Word Mover's Distance" (WMD), might give better pairwise text-similarity scores than comparing single summary vectors. 其他比较利用单词向量的文本相似性的技术,例如“单词移动器的距离”(WMD),可能会比比较单个摘要向量提供更好的成对文本相似性评分。 (WMD doesn't reduce a text to a single vector, and can be expensive to calculate.) (WMD不会将文本简化为单个向量,并且计算起来可能会很昂贵。)

For your example, the averaging of the 3 word vectors (each of 3 dimensions) would yield one single vector of 3 dimensions. 对于您的示例,对3个字向量(每个3维)进行平均将产生一个3维的单个向量。

Centroid-vec = 1/3*(1+4+7, 2+5+8, 3+6+9) = (4, 5, 6) 质心-vec = 1/3 *(1 + 4 + 7,2 + 5 + 8,3 + 6 + 9)=(4、5、6)

A better way to get a single vector for a document is to use paragraph vectors commonly known as doc2vec . 获取文档的单个向量的更好方法是使用通常称为doc2vec的段落向量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM