Gensim Word2Vec model: Cut dimensions

Question

I have a trained word2vec models in geinsim with 300 dimensions and would like to cut the dimensions to 100 (simply drop the last 200 dimensions). What is the easiest and most efficient way using python?

Answer 1

You could save the output model in the word2vec format . Make sure to save it as a text file (.txt). The word2vec format is as follows

First line is <vocabulary_size> <embedding_size> . In your case the <embedding_size> will be 300 . Rest of the lines will be <word><TAB><300 floating point numbers space separated> . Now you can easily parse this file in python and discard the last 200 floating points from each of the lines. Make sure to update the <embedding_size> in your first line. Save this as a new file (optional). Now you can load this new file as a fresh word2vec model using load_word2vec_format() .

Answer 2

You should be able to trim the dimensions inside a KeyedVectors instance, then save it – so you don't have to do anything special with the format on disk. For example:

kv = w2v_model.wv
kv.vectors = kv.vectors[:,0:100]  # keeps just 1st 100 dims
kv.vector_size = 100

Now kv can be saved (as either gensim 's native .save() or the interchange format .save_word2vec_format() ), or just operated on as a subset of the original dimensions.

(While any 100 dimensions of a larger embedding are as likely to be as good as any other, you'll be losing some of the 300-dimensions' expressiveness, in arbitrary ways. Re-training with 100 dimensions to begin with might do better, or using some sort of dimensionality-reduction algorithm which might, in effect, be sure to leave you with the "most expressive" 100 dimensions.)

Gensim Word2Vec model: Cut dimensions

Question

2 answers

solution1
4 ACCPTED 2017-03-27 09:30:41

solution2
1 2020-03-14 22:47:51

Gensim Word2Vec model: Cut dimensions

Question

2 answers

solution1 4 ACCPTED 2017-03-27 09:30:41

solution2 1 2020-03-14 22:47:51

solution1
4 ACCPTED 2017-03-27 09:30:41

solution2
1 2020-03-14 22:47:51