简体   繁体   中英

Gensim Word2Vec model: Cut dimensions

I have a trained word2vec models in geinsim with 300 dimensions and would like to cut the dimensions to 100 (simply drop the last 200 dimensions). What is the easiest and most efficient way using python?

You could save the output model in the word2vec format . Make sure to save it as a text file (.txt). The word2vec format is as follows

First line is <vocabulary_size> <embedding_size> . In your case the <embedding_size> will be 300 . Rest of the lines will be <word><TAB><300 floating point numbers space separated> . Now you can easily parse this file in python and discard the last 200 floating points from each of the lines. Make sure to update the <embedding_size> in your first line. Save this as a new file (optional). Now you can load this new file as a fresh word2vec model using load_word2vec_format() .

You should be able to trim the dimensions inside a KeyedVectors instance, then save it – so you don't have to do anything special with the format on disk. For example:

kv = w2v_model.wv
kv.vectors = kv.vectors[:,0:100]  # keeps just 1st 100 dims
kv.vector_size = 100

Now kv can be saved (as either gensim 's native .save() or the interchange format .save_word2vec_format() ), or just operated on as a subset of the original dimensions.

(While any 100 dimensions of a larger embedding are as likely to be as good as any other, you'll be losing some of the 300-dimensions' expressiveness, in arbitrary ways. Re-training with 100 dimensions to begin with might do better, or using some sort of dimensionality-reduction algorithm which might, in effect, be sure to leave you with the "most expressive" 100 dimensions.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM