简体   繁体   中英

gensim most_similar with positive and negative, how does it work?

I was reading this answer That says about Gensim most_similar :

it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle.

But when I tested it, that is not the case. I trained a Word2Vec with Gensim "text8" dataset and tested these two:

model.most_similar(positive=['woman', 'king'], negative=['man'])

>>> [('queen', 0.7131118178367615), ('prince', 0.6359186768531799),...]

model.wv.most_similar([model["king"] + model["woman"] - model["man"]])

>>> [('king', 0.84305739402771), ('queen', 0.7326322793960571),...]

They are clearly not the same. even the queen score in the first is 0.713 and on the second 0.732 which are not the same.

So I ask the question again, How does Gensim most_similar work? why the result of the two above are different?

The adding and subtracting isn't all that it does; for an exact description, you should look at the source code:

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#LC690:~:text=def%20most_similar,self%2C

You'll see there that the addition and subtraction is on the unit-normed version of each vector, via the get_vector(key, use_norm=True) accessor.

If you change your use of model[key] to model.get_vector(key, use_norm=True) , you should see your outside-the-method calculation of the target vector give the same results as letting the method combine the positive and negative vectors.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM