简体   繁体   中英

Gensim's Word2Vec not training provided documents

I'm facing a Gensim training problem using Word2Vec. model.wv.vocab is not getting any further word from the trained corpus the only words in are from the ones from initialization instruction !

In fact, after many times trying on my own code, even the official site's example didn't work !

I tried saving model at many spots in my code I even tried saving and reloading the corpus alongside train instruction

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

path = get_tmpfile("word2vec.model")

model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

print(len(model.wv.vocab))

model.train([["hello", "world"]], total_examples=1, epochs=1)
model.save("word2vec.model")

print(len(model.wv.vocab))

first print statement gives 12 which is right

second 12 when it's supposed to give 14 (len(vocab + 'hello' + 'world'))

Additional calls to train() don't expand the known vocabulary. So, there is no way that the value of len(model.wv.vocab) will change after another call to train() . (Either 'hello' and 'world' are already known to the model, in which case they were in the original count of 12, or they weren't known, in which case they were ignored.)

The vocabulary is only established during a specific build_vocab() phase, which happens automatically if, as your code shows, you supplied a training corpus ( common_texts ) in model instantiation.

You can use a call to build_vocab() with the optional added parameter update=True to incrementally update a model's vocabulary, but this is best considered an advanced/experimental technique that introduces added complexities. (Whether such vocab-expansion, and then followup incremental training, actually helps or hurts will depend on getting a lot of other murky choices about alpha , epochs , corpus-sizing, training modes, and corpus-contents correct.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM