简体   繁体   中英

Normalize vectors in gensim model

I have a pre-trained word embedding with vectors of different norms, and I want to normalize all vectors in the model. I am doing it with a for loop that iterates each word and normalizes its vector, but the model us huge and takes too much time. Does gensim include any way to do this faster? I cannot find it.


Gensim instances of KeyedVectors (the common interface of sets of word-vectors) contain a method init_sims() , which internally calculates unit-length normalized vectors using a native vector operation for speed.

When certain operations that are usually conducted on unit-normalized vectors are attempted for the 1st time, this init_sims() will be automatically called, and the model will cache the normalized vectors in a model property ( vectors_norm ) – roughly doubling the RAM consumption.

Once it's been called, you can access normed vectors using the .word_vec() method:

normed_wv = kv_model.word_vec(word, use_norm=True)

If you're sure you won't need the raw, un-normed vectors, you can also call init_sim() yourself with its optional replace parameter. Then, the normed vectors will clobber the raw vectors in-place – saving the extra RAM. For example:


Note that while things like finding the nearest-neighbors of a word, as in the common most_similar() operation, traditionally use unit-normalized vectors, there are sometimes downstream applications where the raw vectors are useful. (Also, in a full Word2Vec model, if you're going to do additional incremental training, that should happen on raw vectors, not normalized vectors.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM