简体   繁体   中英

How to add new word vectors to gensim.models.keyedvectors and calculate most_similar

According to Gensim's page on WordEmbeddingKeyedVectors , you can add a new key-value pair of new word vectors incrementally. However, after initializing WordEmbeddingKeyedVectors with pre-trained vectors and its tags, and adding new unseen model-inferred word vectors to it, the most_similar method could no longer be used.

from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

test = WordEmbeddingsKeyedVectors(vector_size=3)

test.add(entities=["1", "2"], weights=[np.random.randint(5, size=3), 
                                  np.random.randint(5, size=3)])

test.most_similar("2") #THIS WORKS

test.add(entities=['3'], weights=[np.random.randint(5, size=3)])

test.most_similar("3") #THIS FAILS

I expect the output to be a list of vector tags most similar to the input tag, but the output is:

IndexError: index 2 is out of bounds for axis 0 with size 2

It appears the add() operation isn't clearing the cache of normalized-to-unit-length vectors that's created & re-used by most_similar() -like operations.

Just before or after performing an add() , you can explicitly delete that cache with:

del test.vectors_norm

Then, your test.most_similar('3') should work without the IndexError .

(I've added a bug-report for this problem to the gensim project.)

In fact, I've figured out a solution to this.

In the gensim.models.keyedvectors file, under class WordEmbeddingKeyedVectors , we can change from

def init_sims(self, replace=False):
    """Precompute L2-normalized vectors."""
    if getattr(self, 'vectors_norm', None) is None or replace:
        logger.info("precomputing L2-norms of word weight vectors")
        self.vectors_norm = _l2_norm(self.vectors, replace=replace)

to

def init_sims(self, replace=False):
    """Precompute L2-normalized vectors."""
    if getattr(self, 'vectors_norm', None) is None or replace:
        logger.info("precomputing L2-norms of word weight vectors")
        self.vectors_norm = _l2_norm(self.vectors, replace=replace)
    elif (len(self.vectors_norm) == len(self.vectors)): #if all of the added vectors are pre-computed into L2-normalized vectors
        pass 
    else: #when there are vectors added but have not been pre-computed into L2-normalized vectors yet
        logger.info("adding L2-norm vectors for new documents")
        diff = len(self.vectors) - len(self.vectors_norm)
        self.vectors_norm = vstack((self.vectors_norm, _l2_norm(self.vectors[-diff:])))

Essentially what original function is doing is if there are no self.vectors_norm , it is calculated by L2-normalizing self.vectors . However, if there are any newly added vectors in self.vectors that have not been pre-computed into L2-normalized vectors, we should pre-compute them then add to the self.vectors_norm .

I'll post this as a comment to your bug-report @gojomo and add a pull request! Thanks :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM