简体繁体 English

gensim doc2vec 从预训练模型中训练更多文档

[英]gensim doc2vec train more documents from pre-trained model

原文 2018-02-21 04:45:14 8 1 gensim/ doc2vec/ pre-trained-model/ resuming-training

I am trying to train with new labelled document(TaggedDocument) with the pre-trained model.我正在尝试使用预先训练的模型使用新的标记文档（TaggedDocument）进行训练。

Pretrained model is the trained model with documents which the unique id with label1_index, for instance, Good_0, Good_1 to Good_999 And the total size of trained data is about 7000预训练模型是训练好的模型，文档的唯一id为label1_index，例如Good_0, Good_1 to Good_999 并且训练数据的总大小约为7000

Now, I want to train the pre-trained model with new documents which the unique id with label2_index, for instance, Bad_0, Bad_1... to Bad_1211 And the total size of trained data is about 1211现在，我想用带有 label2_index 的唯一 id 的新文档训练预训练模型，例如 Bad_0, Bad_1... to Bad_1211 训练数据的总大小约为 1211

The train itself was successful without any error, but the problem is that whenever I try to use 'most_similar' it only suggests the similar document labelled with Good_... where I expect the labelled with Bad_.火车本身是成功的，没有任何错误，但问题是，每当我尝试使用“most_similar”时，它只会建议标有 Good_... 的类似文档，我希望标有 Bad_ 的类似文档。

If I train altogether from the beginning, it gives me the answers I expected - it infers a newly given document similar to either labelled with Good or Bad.如果我从一开始就完全训练，它会给出我预期的答案——它推断出一个新给定的文档，类似于标有“好”或“坏”的标签。

However, the practice above will not work as the one trained altogether from the beginning.但是，上述练习不会像从一开始就完全训练的那样奏效。

Is continuing train not working properly or did I make some mistake?是继续训练不能正常工作还是我犯了一些错误？

1 个解决方案

The gensim Doc2Vec class can always be fed extra examples via train() , but it only discovers the working vocabulary of both word-tokens and document-tags during an initial build_vocab() step. gensim Doc2Vec类总是可以通过train()提供额外的例子，但它只在初始build_vocab()步骤期间发现单词标记和文档标签的工作词汇。 So unless words/tags were available during the build_vocab() , they'll be ignored as unknown later.因此，除非单词/标签在build_vocab()期间可用，否则它们将被忽略为未知。 (The words get silently dropped from the text; the tags aren't trained or remembered inside the model.) （单词从文本中悄悄地删除；标签在模型中没有经过训练或记住。）

The Word2Vec superclass from which Doc2Vec borrows a lot of functionality has a newer, more-experimental parameter on its build_vocab() called update . Word2Vec Doc2Vec借用了很多功能的Word2Vec超类在其build_vocab()上有一个更新的、更具实验性的参数，称为update 。 If set true, that call to build_vocab() will add to, rather than replace, any prior vocabulary.如果设置为 true，则对build_vocab()调用将添加而不是替换任何先前的词汇表。 However, as of February 2018, this option doesn't yet work with Doc2Vec , and indeed often causes memory-fault crashes.但是，截至 2018 年 2 月，此选项尚不适用于Doc2Vec ，并且确实经常导致内存故障崩溃。

But even if/when that can be made to work, providing incremental training examples isn't necessarily a good idea.但即使/何时可以实现，提供增量训练示例也不一定是一个好主意。 By only updating parts of the model – those exercised by the new examples – the overall model can get worse, or its vectors made less self-consistent with each other.通过只更新模型的一部分——那些由新例子练习的部分——整个模型可能会变得更糟，或者它的向量相互之间的自洽性降低。 (The essence of these dense-embedding models is that the optimization over all varied examples results in generally-useful vectors. Training over just some subset causes the model to drift towards being good on just that subset, at likely cost to earlier examples.) （这些密集嵌入模型的本质是对所有不同示例的优化会产生普遍有用的向量。仅对某些子集进行训练会导致模型仅在该子集上趋于良好，这可能会以早期示例为代价。）

If you need new examples to also become part of the results for most_similar() , you might want to create your own separate set-of-vectors outside of Doc2Vec .如果您需要新示例也成为most_similar()的结果的一部分，您可能希望在Doc2Vec之外创建自己的单独的向量Doc2Vec 。 When you infer new vectors for new texts, you could add those to that outside set, and then implement your own most_similar() (using the gensim code as a model) to search over this expanding set of vectors, rather than just the fixed set that is created by initial bulk Doc2Vec training.当您为新文本推断新向量时，您可以将它们添加到该外部集合中，然后实现您自己的most_similar() （使用 gensim 代码作为模型）来搜索这个扩展的向量集合，而不仅仅是固定集合这是由初始批量Doc2Vec培训创建的。