简体   繁体   中英

Gensim Word2vec model is not updating the previous word's embedding weights during increased training

I want to train a previous-trained word2vec model in a increased way that is update the word's weights if the word has been seen in the previous training process and create and update the weights of the new words that has not been seen in the previous training process. For example:

from gensim.models import Word2Vec
# old corpus
corpus = [["0", "1", "2", "3"], ["2", "3", "1"]]
# first train on old corpus
model = Word2Vec(sentences=corpus, size=2, min_count=0, window=2)
# checkout the embedding weights for word "1"
print(model["1"])

# here comes a new corpus with new word "4" and "5"
newCorpus = [["4", "1", "2", "3"], ["1", "5", "2"]]

# update the previous trained model
model.build_vocab(newCorpus, update=True)
model.train(newCorpus, total_examples=model.corpus_count, epochs=1)

# check if new word has embedding weights:
print(model["4"])  # yes

# check if previous word's embedding weights are updated
print(model["1"])  # output the same as before

It seems that the previous word's embedding is not updated even though the previous word's context has benn changed in the new corpus. Could someone tell me how to make the previous embedding weights updated?

Answer for original question

Try printing them out (or even just a few leading dimensions, eg print(model['1'][:5]) ) before & after to see if they've changed.

Or, at the beginning, make preEmbed a proper copy of the values (eg: preEmbed = model['1'].copy() ).

I think you'll see the values have really changed.

Your current preEmbed variable will only be a view into the array that changes along with the underlying array, so will always return True s for your later check.

Reviewing a writeup onNumpy Copies & Views will help explain what's happening with further examples.

Answer for updated code

It's likely that in your subsequent single-epoch training, all examples of '1' are being skipped via the sample downsampling feature, because '1' is a very-frequent word in your tiny corpus: 28.6% of all words. (In realistic natural-language corpora, the most-frequent word won't be more than a few percent of all words.)

I suspect if you disable this downsampling feature with sample=0 , you'll see the changes you expect.

(Note that this feature is really helpful with adequate training data, and more generally, lots of things about Word2Vec & related algorithms, and especially their core benefits, require lots of diverse data – and won't work well, or behave in expected ways, with toy-sized datasets.)

Also note: your second .train() should use an explicitly accurate count for the newCorpus . (Using total_examples=model.corpus_count to re-use the cached corpus count may not always be appropriate when you're supplying extra data, even if it works OK here.)

Another thing to watch out for: once you start using a model for more-sophisticated operations like .most_similar() , it will have cached some calculated data for vector-to-vector comparisons, and this data won't always (at least through gensim-3.8.3 ) be refreshed with more training. So, you may have to discard that data (in gensim-3.8.3 by model.wv.vectors_norm = None ) to be sure to have fresh unit-normed vectors, or fresh most_similar() (& related method) results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM