简体   繁体   English

Gensim Word2vec 模型在增加训练期间不会更新前一个词的嵌入权重

[英]Gensim Word2vec model is not updating the previous word's embedding weights during increased training

I want to train a previous-trained word2vec model in a increased way that is update the word's weights if the word has been seen in the previous training process and create and update the weights of the new words that has not been seen in the previous training process.我想以增加的方式训练一个先前训练过的 word2vec 模型,如果该词在先前的训练过程中见过,则更新该词的权重,并创建和更新先前训练中未见过的新词的权重过程。 For example:例如:

from gensim.models import Word2Vec
# old corpus
corpus = [["0", "1", "2", "3"], ["2", "3", "1"]]
# first train on old corpus
model = Word2Vec(sentences=corpus, size=2, min_count=0, window=2)
# checkout the embedding weights for word "1"
print(model["1"])

# here comes a new corpus with new word "4" and "5"
newCorpus = [["4", "1", "2", "3"], ["1", "5", "2"]]

# update the previous trained model
model.build_vocab(newCorpus, update=True)
model.train(newCorpus, total_examples=model.corpus_count, epochs=1)

# check if new word has embedding weights:
print(model["4"])  # yes

# check if previous word's embedding weights are updated
print(model["1"])  # output the same as before

It seems that the previous word's embedding is not updated even though the previous word's context has benn changed in the new corpus.似乎即使前一个词的上下文在新语料库中发生了变化,前一个词的嵌入也没有更新。 Could someone tell me how to make the previous embedding weights updated?有人能告诉我如何更新以前的嵌入权重吗?

Answer for original question回答原始问题

Try printing them out (or even just a few leading dimensions, eg print(model['1'][:5]) ) before & after to see if they've changed.尝试在前后打印它们(甚至只是几个主要尺寸,例如print(model['1'][:5]) )以查看它们是否已更改。

Or, at the beginning, make preEmbed a proper copy of the values (eg: preEmbed = model['1'].copy() ).或者,在开始时,使preEmbed成为值的正确副本(例如: preEmbed = model['1'].copy() )。

I think you'll see the values have really changed.我想你会看到价值观真的发生了变化。

Your current preEmbed variable will only be a view into the array that changes along with the underlying array, so will always return True s for your later check.您当前的preEmbed变量将只是随底层数组更改的数组的视图,因此将始终返回True以供以后检查。

Reviewing a writeup onNumpy Copies & Views will help explain what's happening with further examples.查看关于Numpy Copies & Views 的文章将有助于解释更多示例中发生的情况。

Answer for updated code更新代码的答案

It's likely that in your subsequent single-epoch training, all examples of '1' are being skipped via the sample downsampling feature, because '1' is a very-frequent word in your tiny corpus: 28.6% of all words.很可能在您随后的单轮训练中, '1'所有示例都通过sample下采样功能被跳过,因为'1'在您的小语料库中是一个非常频繁的词:占所有词的 28.6%。 (In realistic natural-language corpora, the most-frequent word won't be more than a few percent of all words.) (在现实的自然语言语料库中,最常用的词不会超过所有词的百分之几。)

I suspect if you disable this downsampling feature with sample=0 , you'll see the changes you expect.我怀疑如果您使用sample=0禁用此下采样功能,您将看到您期望的更改。

(Note that this feature is really helpful with adequate training data, and more generally, lots of things about Word2Vec & related algorithms, and especially their core benefits, require lots of diverse data – and won't work well, or behave in expected ways, with toy-sized datasets.) (请注意,此功能对于足够的训练数据确实很有帮助,更一般地说,关于Word2Vec和相关算法的很多事情,尤其是它们的核心优势,需要大量不同的数据——并且不会很好地工作,或者以预期的方式运行,使用玩具大小的数据集。)

Also note: your second .train() should use an explicitly accurate count for the newCorpus .另请注意:您的第二个.train()应该对newCorpus使用明确准确的计数。 (Using total_examples=model.corpus_count to re-use the cached corpus count may not always be appropriate when you're supplying extra data, even if it works OK here.) (当您提供额外数据时,使用total_examples=model.corpus_count重新使用缓存的语料库计数可能并不总是合适的,即使它在这里工作正常。)

Another thing to watch out for: once you start using a model for more-sophisticated operations like .most_similar() , it will have cached some calculated data for vector-to-vector comparisons, and this data won't always (at least through gensim-3.8.3 ) be refreshed with more training.另一件需要注意的事情:一旦你开始使用一个模型进行更复杂的操作,比如.most_similar() ,它就会缓存一些计算数据以进行向量到向量的比较,并且这些数据不会总是(至少通过gensim-3.8.3 )通过更多的培训来更新。 So, you may have to discard that data (in gensim-3.8.3 by model.wv.vectors_norm = None ) to be sure to have fresh unit-normed vectors, or fresh most_similar() (& related method) results.因此,您可能必须丢弃该数据(在gensim-3.8.3model.wv.vectors_norm = None )以确保获得新的单位归一向量或新的most_similar() (和相关方法)结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM