简体繁体 English

有没有一种方法可以在训练doc2vec模型之前加载预训练的词向量？

[英]Is there a way to load pre-trained word vectors before training the doc2vec model?

原文 2019-07-12 01:29:56 2 1 gensim/ doc2vec

I am trying to build a doc2vec model with more or less 10k sentences, after that I will use the model to find the most similar sentence in the model of some new sentences. 我试图建立一个包含大约10k句子的doc2vec模型，此后，我将使用该模型在某些新句子的模型中找到最相似的句子。

I have trained a gensim doc2vec model using the corpus(10k sentences) I have. 我使用我拥有的语料库（10k个句子）训练了gensim doc2vec模型。 This model can to some extend tell me if a new sentence is similar to some of the sentences in the corpus. 该模型可以在某种程度上告诉我新句子是否与语料库中的某些句子相似。 But, there is a problem: it may happen that there are words in new sentences which don't exist in the corpus, which means that they don't have a word embedding. 但是，有一个问题：语料库中可能不存在新句子中的单词，这意味着它们没有单词嵌入。 If this happens, the prediction result will not be good. 如果发生这种情况，预测结果将不会很好。 As far as I know, the trained doc2vec model does have a matrix of doc vectors as well as a matrix of word vectors. 据我所知，训练有素的doc2vec模型确实具有doc向量矩阵和词向量矩阵。 So what I were thinking is to load a set of pre-trained word vectors, which contains a large number of words, and then train the model to get the doc vectors. 所以我在想的是加载一组包含大量单词的预训练词向量，然后训练模型以获取doc向量。 Does it make sense? 是否有意义？ Is it possible with gensim? gensim是否可能？ Or is there another way to do it? 还是有其他方法可以做到？

1 个解决方案

Unlike what you might guess, typical Doc2Vec training does not train up word-vectors first, then compose doc-vectors using those word-vectors. 与您可能会猜到的不同，典型的Doc2Vec培训不会 Doc2Vec训练单词向量，然后再使用这些单词向量组成doc向量。 Rather, in the modes that use word-vectors, the word-vectors trained in a simultaneous, interleaved fashion alongside the doc-vectors, both changing together. 而是，在使用单词向量的模式中，单词向量与doc向量一起以同时，交错的方式训练，两者都一起改变。 And in one fast and well-performing mode, PV-DBOW ( dm=0 in gensim), word-vectors aren't trained or used at all. 而且在一种快速且性能良好的模式下，PV-DBOW（在gensim中dm=0 ），根本不训练或使用字向量。

So, gensim Doc2Vec doesn't support pre-loading state from elsewhere, and even if it did, it probably wouldn't provide the benefit you expect. 因此，gensim Doc2Vec不支持从其他位置进行预加载状态，即使支持，也可能无法提供您期望的收益。 (You could dig through the source code & perhaps force it by doing a bunch of initialization steps yourself. But then, if words were in the pre-loaded set, but not in your training data, training the rest of the active words would adjust the entire model in direction incompatible with the imported-but-untrained 'foreign' words. It's only the interleaved, tug-of-war co-training of the model's state which makes the various vectors meaningful in relation to each other.) （您可以挖掘源代码，或者自己做一堆初始化步骤来强制执行。但是，如果单词在预加载的集合中，而不是在训练数据中，则对其余活动单词的训练会进行调整整个模型的方向与导入但未训练的“外来”单词不兼容。仅仅是模型状态的交错式，拔河式协同训练，使得各种向量彼此之间有意义。）

The most straightforward and reliable strategy would be to try to expand your training corpus, by finding more documents from a similar/compatible domain, to include multiple varied examples of any words you might encounter later. 最直接，最可靠的策略是尝试通过从相似/兼容的域中查找更多文档来扩展您的训练语料库，以包括您以后可能遇到的任何单词的多个示例。 (If you thought some other word-vectors were apt enough for your domain, perhaps the texts that were used to train those word-vectors can be mixed-into your training corpus. That's a reasonable way to put the word/document data from that other source on equal footing in your model.) （如果您认为其他词向量适合您的领域，那么也许用于训练这些词向量的文本可以混入您的训练语料库中。这是从其中输入词/文档数据的一种合理方法在模型中处于平等地位的其他来源。）

And, as new documents arrive, you can also occasionally re-train the model from scratch, with the now-expanded corpus, letting newer documents contribute equally to the model's vocabulary and modeling strength. 而且，随着新文档的到来，您还可以偶尔使用现在扩展的语料库从头开始重新训练模型，让新文档同样为模型的词汇量和建模强度做出贡献。