简体   繁体   English

如何使用另一个模型的词汇表初始化 gensim 模型?

[英]How to initialize a gensim model with the vocabulary from another model?

I'm training some embeddings on a large corpus.我正在一个大型语料库上训练一些嵌入。 I gather from gensim 's documentation that it builds the vocabulary before beginning training.我从gensim的文档中了解到,它在开始训练之前构建了词汇表。 In my case, building the vocabulary takes many hours.就我而言,建立词汇表需要很多小时。 I'd like to save time by re-using the vocabulary from the first model.我想通过重新使用第一个模型中的词汇来节省时间。 How can I do this?我怎样才能做到这一点? the .build_vocab method can't take the vocabulary object from another model. .build_vocab方法不能从另一个模型中获取vocabulary对象。

Here's a dummy example:这是一个虚拟示例:

from gensim.models import FastText, Word2Vec
sentences = ["where are my goats", "yay i found my goats"]
m1 = Word2Vec(sentences, size  = 3)
m2 = Word2Vec(size = 4)
m2.build_vocab(m1.vocabulary) # doesn't work

基于这个错误(“重用用 scan_vocab 构建的词汇是不可能的”)我相信这在这个时候是不可能的。

build_vocab() says to survey a corpus of texts & configure the model's vocabulary from that corpus – so it doesn't take another model's internal state. build_vocab()表示要调查文本语料库并从该语料库配置模型的词汇表 - 因此它不会采用另一个模型的内部状态。

But you could either:但你可以:

  • save the model after vocabulary-discovery, for later reuse from that point on;在词汇发现后保存模型,以便以后重用; or或者
  • just directly modify any model however you'd like, to mimic another's state只需直接修改任何模型即可,以模仿他人的状态

For example, consider an initial session:例如,考虑一个初始会话:

vocab_model = Word2Vec(size=3)
vocab_model.build_vocab(sentences)
vocab_model.save('vocab_initialized_but_untrained_model.w2v')

Now, you could immediately continue to train that model...现在,您可以立即继续训练该模型......

vocab_model.train(sentences, total_examples=vocab_model.corpus_count, epochs=10)

...and then perhaps do other work with, and .save() , that trained model, too. ...然后也许还可以用.save()做其他工作,这个训练有素的模型。

But then also, later, you could simply re-load the vocabulary-initialized model and do other tinkering/training:但是,稍后,您可以简单地重新加载词汇初始化模型并进行其他修补/训练:

prior_model = Word2Vec.load('vocab_initialized_but_untrained_model.w2v')
// more operations on that model

And further, you can always directly modify the parts of a model however you'd like – though in some cases this may break the existing code's expectations about the model's state.此外,您始终可以根据需要直接修改模型的各个部分——尽管在某些情况下,这可能会破坏现有代码对模型状态的期望。 For example:例如:

source_model = Word2Vec.load('original_model.w2v')
new_model = Word2Vec(size=4)
new_model.vocabulary = source_model.vocabulary

(You probably need to copy over some other fields as well, to mimic all the effects of the 1st model's initialization, and perhaps re-trigger the final steps of the build_vocab() with your new size /modes. See the source code & especially the methods having to do with _weights( or prepare_ .) (您可能还需要复制一些其他字段,以模拟第一个模型初始化的所有效果,并且可能使用新的size /模式重新触发build_vocab()的最后步骤。请参阅源代码,尤其是与_weights(prepare_ .) 有关的方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Gensim Fasttext预训练模型如何获得词汇外单词的向量? - How does the Gensim Fasttext pre-trained model get vectors for out-of-vocabulary words? 如何初始化gensim LDA主题模型? - How can I initialize a gensim LDA topic model? 有没有办法保存和加载Gensim Doc2Vec模型的词汇表 - Is there a way to save and load the vocabulary of a Gensim Doc2Vec model 训练gensim word2vec模型后,词汇不在词汇表中,为什么? - word not in vocabulary after training gensim word2vec model, why? 如何改进gensim的主题模型 - how to improve topic model of gensim 如何从gensim的word2vec中提取词汇向量? - How extract vocabulary vectors from gensim's word2vec? 如何使用gensim从受约束的词汇中过滤出语料库中的单词? - How to filter out words in a corpus from a constrained vocabulary with gensim? 如何从gensim中的Word2Vec模型中完全删除单词? - How to remove a word completely from a Word2Vec model in gensim? Python:gensim:RuntimeError:在训练模型之前必须首先构建词汇表 - Python: gensim: RuntimeError: you must first build vocabulary before training the model 使用带有 Gensim 的西班牙预训练 model 导致引发 KeyError(“单词'%s'不在词汇表中”% word) - using a Spanish pretrained model with Gensim causes raise KeyError(“word '%s' not in vocabulary” % word)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM