According to Gensim's page on WordEmbeddingKeyedVectors , you can add a new key-value pair of new word vectors incrementally. However, after initializing WordEmbeddingKeyedVectors with pre-trained vectors and its tags, and adding new unseen model-inferred word vectors to it, the most_similar
method could no longer be used.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
test = WordEmbeddingsKeyedVectors(vector_size=3)
test.add(entities=["1", "2"], weights=[np.random.randint(5, size=3),
np.random.randint(5, size=3)])
test.most_similar("2") #THIS WORKS
test.add(entities=['3'], weights=[np.random.randint(5, size=3)])
test.most_similar("3") #THIS FAILS
I expect the output to be a list of vector tags most similar to the input tag, but the output is:
IndexError: index 2 is out of bounds for axis 0 with size 2
It appears the add()
operation isn't clearing the cache of normalized-to-unit-length vectors that's created & re-used by most_similar()
-like operations.
Just before or after performing an add()
, you can explicitly delete that cache with:
del test.vectors_norm
Then, your test.most_similar('3')
should work without the IndexError
.
(I've added a bug-report for this problem to the gensim project.)
In fact, I've figured out a solution to this.
In the gensim.models.keyedvectors
file, under class WordEmbeddingKeyedVectors
, we can change from
def init_sims(self, replace=False):
"""Precompute L2-normalized vectors."""
if getattr(self, 'vectors_norm', None) is None or replace:
logger.info("precomputing L2-norms of word weight vectors")
self.vectors_norm = _l2_norm(self.vectors, replace=replace)
to
def init_sims(self, replace=False):
"""Precompute L2-normalized vectors."""
if getattr(self, 'vectors_norm', None) is None or replace:
logger.info("precomputing L2-norms of word weight vectors")
self.vectors_norm = _l2_norm(self.vectors, replace=replace)
elif (len(self.vectors_norm) == len(self.vectors)): #if all of the added vectors are pre-computed into L2-normalized vectors
pass
else: #when there are vectors added but have not been pre-computed into L2-normalized vectors yet
logger.info("adding L2-norm vectors for new documents")
diff = len(self.vectors) - len(self.vectors_norm)
self.vectors_norm = vstack((self.vectors_norm, _l2_norm(self.vectors[-diff:])))
Essentially what original function is doing is if there are no self.vectors_norm
, it is calculated by L2-normalizing self.vectors
. However, if there are any newly added vectors in self.vectors
that have not been pre-computed into L2-normalized vectors, we should pre-compute them then add to the self.vectors_norm
.
I'll post this as a comment to your bug-report @gojomo and add a pull request! Thanks :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.