简体   繁体   中英

Online updating Word2Vec

I've got a problem with online updating my Word2Vec model.

I have a document and build model by it. But this document can update with new words, and I need to update vocabulary and model in general.

I know that in gensim 0.13.4.1 we can do this

My code:

model = gensim.models.Word2Vec(size=100, window=10, min_count=5, workers=11, alpha=0.025, min_alpha=0.025, iter=20)
model.build_vocab(sentences, update=False)

model.train(sentences, epochs=model.iter, total_examples=model.corpus_count)

model.save('model.bin')

And after this I have new words. For ex:

sen2 = [['absd', 'jadoih', 'sdohf'], ['asdihf', 'oisdh', 'oiswhefo'], ['a', 'v', 'b', 'c'], ['q', 'q', 'q']]

model.build_vocab(sen2, update=True)
model.train(sen2, epochs=model.iter, total_examples=model.corpus_count)

What's wrong and how can I solve my problem?

Your model is set to ignore words with fewer than 5 occurrences: min_count=5 . It will, in fact, require at least 5 occurrences in a single build_vocab() call. (It won't remember there were 3 before, then see 2 new occurrences, then train on all 5. It needs all 5 or more in one batch.)

If you're only calling your update with the tiny dataset shown, no new words will make the cut.

More generally, if at all possible, you should retrain the whole model with all old and new data. That will ensure equal influence is given to old and new words, and any words are treated properly according to their combined frequency. Making small incremental updates to a Word2Vec model risks pulling newer words, or old words that continue to reappear, out of meaningful arrangement with older words that were only trained in the original (or earlier) batches. (Only words that go through the same interleaved training cycles are fully positionally adjusted with respect to each other.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM