如何使用另一个模型的词汇表初始化 gensim 模型？

Question

I'm training some embeddings on a large corpus.我正在一个大型语料库上训练一些嵌入。 I gather from gensim 's documentation that it builds the vocabulary before beginning training.我从gensim的文档中了解到，它在开始训练之前构建了词汇表。 In my case, building the vocabulary takes many hours.就我而言，建立词汇表需要很多小时。 I'd like to save time by re-using the vocabulary from the first model.我想通过重新使用第一个模型中的词汇来节省时间。 How can I do this?我怎样才能做到这一点？ the .build_vocab method can't take the vocabulary object from another model. .build_vocab方法不能从另一个模型中获取vocabulary对象。

Here's a dummy example:这是一个虚拟示例：

from gensim.models import FastText, Word2Vec
sentences = ["where are my goats", "yay i found my goats"]
m1 = Word2Vec(sentences, size  = 3)
m2 = Word2Vec(size = 4)
m2.build_vocab(m1.vocabulary) # doesn't work

Answer 1

基于这个错误（“重用用 scan_vocab 构建的词汇是不可能的”）我相信这在这个时候是不可能的。

Answer 2

build_vocab() says to survey a corpus of texts & configure the model's vocabulary from that corpus – so it doesn't take another model's internal state. build_vocab()表示要调查文本语料库并从该语料库配置模型的词汇表 - 因此它不会采用另一个模型的内部状态。

But you could either:但你可以：

save the model after vocabulary-discovery, for later reuse from that point on;在词汇发现后保存模型，以便以后重用； or或者
just directly modify any model however you'd like, to mimic another's state只需直接修改任何模型即可，以模仿他人的状态

For example, consider an initial session:例如，考虑一个初始会话：

vocab_model = Word2Vec(size=3)
vocab_model.build_vocab(sentences)
vocab_model.save('vocab_initialized_but_untrained_model.w2v')

Now, you could immediately continue to train that model...现在，您可以立即继续训练该模型......

vocab_model.train(sentences, total_examples=vocab_model.corpus_count, epochs=10)

...and then perhaps do other work with, and .save() , that trained model, too. ...然后也许还可以用.save()做其他工作，这个训练有素的模型。

But then also, later, you could simply re-load the vocabulary-initialized model and do other tinkering/training:但是，稍后，您可以简单地重新加载词汇初始化模型并进行其他修补/训练：

prior_model = Word2Vec.load('vocab_initialized_but_untrained_model.w2v')
// more operations on that model

And further, you can always directly modify the parts of a model however you'd like – though in some cases this may break the existing code's expectations about the model's state.此外，您始终可以根据需要直接修改模型的各个部分——尽管在某些情况下，这可能会破坏现有代码对模型状态的期望。 For example:例如：

source_model = Word2Vec.load('original_model.w2v')
new_model = Word2Vec(size=4)
new_model.vocabulary = source_model.vocabulary

(You probably need to copy over some other fields as well, to mimic all the effects of the 1st model's initialization, and perhaps re-trigger the final steps of the build_vocab() with your new size /modes. See the source code & especially the methods having to do with _weights( or prepare_ .) （您可能还需要复制一些其他字段，以模拟第一个模型初始化的所有效果，并且可能使用新的size /模式重新触发build_vocab()的最后步骤。请参阅源代码，尤其是与_weights(或prepare_ .) 有关的方法

如何使用另一个模型的词汇表初始化 gensim 模型？

问题描述

2 个解决方案

解决方案1
0 2019-12-19 06:07:05

解决方案2
0 2019-12-19 18:23:47

如何使用另一个模型的词汇表初始化 gensim 模型？

问题描述

2 个解决方案

解决方案1 0 2019-12-19 06:07:05

解决方案2 0 2019-12-19 18:23:47

解决方案1
0 2019-12-19 06:07:05

解决方案2
0 2019-12-19 18:23:47