简体   繁体   中英

Gensim Word2Vec Model trained but not saved

I am using gensim and executed the following code (simplified):

model = gensim.models.Word2Vec(...)
mode.build_vocab(sentences)
model.train(...)
model.save('file_name')

After days my code finished model.train(...) . However, during saving, I experienced:

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

I noticed that there were some npy files generated:

<...>.trainables.syn1neg.npy
<...>.trainables.vectors_lockf.npy
<...>.wv.vectors.npy

Are those intermediate results I can re-use or do I have to rerun the entire process?

Those are parts of the saved model, but unless the master file_name file (a Python-pickled object) exists and is complete, they may be hard to re-use.

However, if your primary interest is the final word-vectors, those are in the .wv.vectors.npy file. If it appears to be full-length (same size at the syn1neg file), it may be complete. What you're missing is the dict that tells you which word is in which index.

So, the following might work:

  1. Repeat the original process, with the exact same corpus & model parameters, but only through the build_vocab() step. At that point, the new model.wv.vocab dict should be identical as the one from the failed-save-run.

  2. Save that model, without ever train() ing it, to a new filename.

  3. Confirming that newmodel.wv.vectors.npy (with randomly-initialized untrained vectors) is the same size as oldmodel.wv.vectors.npy , copy the oldmodel file to the newmodel's name.

  4. Re-load the new model, and run some sanity checks that the words make sense.

  5. Perhaps, save off just the word-vectors, using something like newmodel.wv.save() or newmodel.wvsave_word2vec_format() .

Potentially, the resurrected newmodel could also be patched to use the old syn1neg file as well, if it appears complete. It might work to further train the patched model (either with or without having reused the older syn1neg).

Separately: only the very largest corpuses, or an installation missing the gensim cython optimizations, or a machine without enough RAM (and thus swapping during training), would usually require a training session taking days. You might be able to run much faster. Check:

  • Is any virtual-memory swapping happening during the entire training? If it is, it will be disastrous for training throughput, and you should use a machine with more RAM or be more aggressive about trimming the vocabulary/model size with a higher min_count . (Smaller min_count values mean a larger model, slower training, poor-quality vectors for words with just a few examples, and also counterintuitively worse-quality vectors for more-frequent words too, because of interference from the noisy rare words. It's usually better to ignore lowest-frequency words.)

  • Is there any warning displayed about a "slow version" (pure Python with no effective multi-threading) being used? If so your training will be ~100X slower than if that problem is resolved. If the optimized code is available, maximum training throughput will likely be achieved with some workers value between 3 and 12 (but never larger than the number of machine CPU cores).

  • For a very large corpus, the sample parameter can be made more aggressive – such as 1e-04 or 1e-05 instead of the default 1e-03 – and it may both speed training and improve vector quality, by avoiding lots of redundant overtraining of the most-frequent words.

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM