简体繁体 English

R 包'word2vec' doc2vec 函数

[英]R package 'word2vec' doc2vec function

原文 2020-11-10 15:52:26 0 1 r/ word2vec/ doc2vec

I am a student (computer science).我是一名学生（计算机科学）。 This is my first question in stackoverflow.这是我在 stackoverflow 中的第一个问题。 I really would appreciate your help!我真的很感激你的帮助！ (The package I am referring to is called 'word2vec', thats why the tags/title are a bit confusing to choose.) （我所指的包称为“word2vec”，这就是标签/标题的选择有点混乱的原因。）

In the description of the doc2vec function (here https://cran.r-project.org/web/packages/word2vec/word2vec.pdf ) it says:在 doc2vec 函数的描述中（这里是https://cran.r-project.org/web/packages/word2vec/word2vec.pdf ）它说：

Document vectors are the sum of the vectors of the words which are part of the document standardised by the scale of the vector space.文档向量是词向量的总和，这些词是由向量空间的尺度标准化的文档的一部分。 This scale is the sqrt of the average inner product of the vector elements.这个比例是向量元素的平均内积的平方。

From what I understood, doc2vec takes one additional vector for every paragraph.据我了解， doc2vec 为每一段增加一个向量。 Which, in my eyes, seems to be different than the above description.在我看来，这似乎与上面的描述不同。

Is my understanding of doc2vec correct, or close enough?我对 doc2vec 的理解是否正确或足够接近？ And: Does the cited implementation work like the doc2vec-algorithm?并且：引用的实现是否像 doc2vec-algorithm 一样工作？

1 个解决方案

Many people use "Doc2Vec" to refer to the word2vec-like algorithm introduced by a paper titled Distributed Representation of Sentences and Documents (by Le & Mikolov).许多人使用“Doc2Vec”来指代一篇名为Distributed Representation of Sentences and Documents （Le & Mikolov）的论文介绍的类似 word2vec 的算法。 That paper calls the algorithm 'Paragraph Vector', without using the name 'Doc2Vec', and indeed introduces an extra vector per document, like you describe.那篇论文将算法称为“段落向量”，而不使用名称“Doc2Vec”，并且确实像您描述的那样为每个文档引入了一个额外的向量。 (That is, the doc-vector is trained a bit like a 'floating' pseudoword-vector, that contributes to to the input 'context' for every training prediction in that document.) （也就是说，文档向量的训练有点像“浮动”伪词向量，它有助于该文档中每个训练预测的输入“上下文”。）

I'm not familiar with R or that R word2vec package, but from the docs you forwarded, it does not sound like that doc2vec function implements the 'Paragraph Vector' algorithm that others call 'Doc2Vec'.我不熟悉的R或是R word2vec包，但你转发的文档，它听起来并不像doc2vec功能实现了“段矢量”算法，别人称之为“Doc2Vec”。 In particular:特别是：

'Paragraph Vector' doc-vectors are not a simple sum-of-word-vectors “段落向量”文档向量不是简单的词向量总和
'Paragraph Vector' doc-vectors are created by a separate word2vec-like training process that co-creates any necessary word-vectors simultaneous with that training. “段落向量”文档向量是由一个单独的类似 word2vec 的训练过程创建的，该过程在该训练的同时共同创建任何必要的词向量。 Specifically: that process does not normally use as input some other pre-trained word-vectors, nor create word-vectors as a 1st step.具体来说：该过程通常不会使用其他一些预训练的词向量作为输入，也不会将创建词向量作为第一步。 (And further: the PV-DBOW option of the 'Paragraph Vector' paper doesn't create traditional word-vectors at all.) （此外：“段落向量”论文的 PV-DBOW 选项根本不会创建传统的词向量。）

It appears that function is poorly-named, and if you need to use the actual 'Paragraph Vector' algorithm, you will need to look elsewhere.该函数的名称似乎很差，如果您需要使用实际的“段落向量”算法，则需要查看其他地方。