[英]How to initialize a gensim model with the vocabulary from another model?
I'm training some embeddings on a large corpus.我正在一个大型语料库上训练一些嵌入。 I gather from
gensim
's documentation that it builds the vocabulary before beginning training.我从
gensim
的文档中了解到,它在开始训练之前构建了词汇表。 In my case, building the vocabulary takes many hours.就我而言,建立词汇表需要很多小时。 I'd like to save time by re-using the vocabulary from the first model.
我想通过重新使用第一个模型中的词汇来节省时间。 How can I do this?
我怎样才能做到这一点? the
.build_vocab
method can't take the vocabulary
object from another model. .build_vocab
方法不能从另一个模型中获取vocabulary
对象。
Here's a dummy example:这是一个虚拟示例:
from gensim.models import FastText, Word2Vec
sentences = ["where are my goats", "yay i found my goats"]
m1 = Word2Vec(sentences, size = 3)
m2 = Word2Vec(size = 4)
m2.build_vocab(m1.vocabulary) # doesn't work
基于这个错误(“重用用 scan_vocab 构建的词汇是不可能的”)我相信这在这个时候是不可能的。
build_vocab()
says to survey a corpus of texts & configure the model's vocabulary from that corpus – so it doesn't take another model's internal state. build_vocab()
表示要调查文本语料库并从该语料库配置模型的词汇表 - 因此它不会采用另一个模型的内部状态。
But you could either:但你可以:
For example, consider an initial session:例如,考虑一个初始会话:
vocab_model = Word2Vec(size=3)
vocab_model.build_vocab(sentences)
vocab_model.save('vocab_initialized_but_untrained_model.w2v')
Now, you could immediately continue to train that model...现在,您可以立即继续训练该模型......
vocab_model.train(sentences, total_examples=vocab_model.corpus_count, epochs=10)
...and then perhaps do other work with, and .save()
, that trained model, too. ...然后也许还可以用
.save()
做其他工作,这个训练有素的模型。
But then also, later, you could simply re-load the vocabulary-initialized model and do other tinkering/training:但是,稍后,您可以简单地重新加载词汇初始化模型并进行其他修补/训练:
prior_model = Word2Vec.load('vocab_initialized_but_untrained_model.w2v')
// more operations on that model
And further, you can always directly modify the parts of a model however you'd like – though in some cases this may break the existing code's expectations about the model's state.此外,您始终可以根据需要直接修改模型的各个部分——尽管在某些情况下,这可能会破坏现有代码对模型状态的期望。 For example:
例如:
source_model = Word2Vec.load('original_model.w2v')
new_model = Word2Vec(size=4)
new_model.vocabulary = source_model.vocabulary
(You probably need to copy over some other fields as well, to mimic all the effects of the 1st model's initialization, and perhaps re-trigger the final steps of the build_vocab()
with your new size
/modes. See the source code & especially the methods having to do with _weights(
or prepare_
.) (您可能还需要复制一些其他字段,以模拟第一个模型初始化的所有效果,并且可能使用新的
size
/模式重新触发build_vocab()
的最后步骤。请参阅源代码,尤其是与_weights(
或prepare_
.) 有关的方法
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.