简体繁体 English

如何在 doc2vec 模型中使用预训练的 word2vec 向量？

[英]How to use pretrained word2vec vectors in doc2vec model?

原文 2019-02-19 09:11:19 0 1 python/ machine-learning/ nlp/ word2vec/ doc2vec

I am trying to implement doc2vec, but I am not sure how the input for the model should look like if I have pretrained word2vec vectors.我正在尝试实现 doc2vec，但我不确定如果我已经预训练了 word2vec 向量，模型的输入应该是什么样子。

The problem is, that I am not sure how to theoretically use pretrained word2vec vectors for doc2vec.问题是，我不确定如何在理论上为 doc2vec 使用预训练的 word2vec 向量。 I imagine, that I could prefill the hidden layer with the vectors and the rest of the hidden layer fill with random numbers我想，我可以用向量预填充隐藏层，然后用随机数填充隐藏层的其余部分

Another idea is to use the vector as input for word instead of a one-hot-encoding but I am not sure if the output vectors for docs would make sense.另一个想法是使用向量作为词的输入而不是单热编码，但我不确定文档的输出向量是否有意义。

Thank you for your answer!谢谢您的回答！

1 个解决方案

You might think that Doc2Vec (aka the 'Paragraph Vector' algorithm of Mikolov/Le) requires word-vectors as a 1st step.您可能认为Doc2Vec （又名 Mikolov/Le 的“段落向量”算法）需要词向量作为第一步。 That's a common belief, and perhaps somewhat intuitive, by analogy to how humans learn a new language: understand the smaller units before the larger, then compose the meaning of the larger from the smaller.这是一个普遍的信念，也许有点直观，类似于人类学习新语言的方式：先了解较小的单元，然后再了解较大的单元，然后从较小的单元组合出较大的含义。

But that's a common misconception, and Doc2Vec doesn't do that.但这是一个常见的误解， Doc2Vec不会这样做。

One mode, pure PV-DBOW ( dm=0 in gensim), doesn't use conventional per-word input vectors at all.一种模式，纯 PV-DBOW（在 gensim 中dm=0 ），根本不使用传统的每字输入向量。 And, this mode is often one of the fastest-training and best-performing options.而且，这种模式通常是训练速度最快、性能最好的选项之一。

The other mode, PV-DM ( dm=1 in gensim, the default) does make use of neighboring word-vectors, in combination with doc-vectors in a manner analgous to word2vec's CBOW mode – but any word-vectors it needs will be trained-up simultaneously with doc-vectors.另一种模式，PV-DM（在 gensim 中为dm=1 ，默认值）确实使用相邻的词向量，并以类似于 word2vec 的 CBOW 模式的方式结合 doc-vectors——但它需要的任何词向量都将是与 doc-vectors 同时训练。 They are not trained 1st in a separate step, so there's not a easy splice-in point where you could provide word-vectors from elsewhere.他们不是在单独的步骤中进行第一次训练，因此没有一个简单的拼接点，您可以从其他地方提供词向量。

(You can mix skip-gram word-training into the PV-DBOW, with dbow_words=1 in gensim, but that will train word-vectors from scratch in an interleaved, shared-model process.) （您可以将 skip-gram 词训练混合到 PV-DBOW 中，在 gensim 中使用dbow_words=1 ，但这将在交错的共享模型过程中从头开始训练词向量。）

To the extent you could pre-seed a model with word-vectors from elsewhere, it wouldn't necessarily improve results: it could easily send their quality sideways or worse.在某种程度上，您可以使用来自其他地方的词向量预先植入模型，它不一定会改善结果：它很容易使质量变差或更糟。 It might in some lucky well-managed cases speed model convergence, or be a way to enforce vector-space-compatibility with an earlier vector-set, but not without extra gotchas and caveats that aren't a part of the original algorithms, or well-described practices.在一些幸运的管理良好的情况下，它可能会加速模型收敛，或者是一种与较早的向量集强制执行向量空间兼容性的方法，但并非没有额外的陷阱和警告，这些陷阱和警告不属于原始算法的一部分，或者详细描述的做法。