简体   繁体   中英

How to calculate difference vector in word2vec

I have a binary word2vec file and I am using gensim to load it.

While there is function to get similarity between 2 words in gensim but no function to calculate and return difference vector.

How can I use two vectors and get there difference vector?

And also I am trying to use these difference vectors as feature in document classification.Calculating diff vector between each word and each class.Is this right approach?

For example if classes are sport and politics

sport = [0.4,0.456,45,...] #wordvector of class
politics = [0.23,0.56...] #wordvector of class

And my word is football

football = [0.2,0.6,0.45,...] #wordvector of football

I want to calculate diff vector

(sport - football) = [some vector] # this as a feature for classification

How can I use two vectors and get there difference vector?

Your premonition of simply subtracting two vectors seems correct (source: https://blog.galvanize.com/add-and-subtract-words-like-vectors-with-word2vec-2/ ). You can use tensorflow to subtract these word vectors, if I remember correctly gensim uses tf.

And also I am trying to use these difference vectors as feature in document classification.Calculating diff vector between each word and each class.Is this right approach?

I don't know your goal, but I would look into training your own neural net, to classify words/documents, I would look at the new package flair to help you with that. https://github.com/zalandoresearch/flair/issues/787

The vectors themselves support subtraction via the normal Python - operator, so if your loaded word-vectors are in the variable wv , it really is as simple as:

diff_vector = wv['sport'] - wv['football']

You could then try to find other vectors closest to the new vector via:

wv.most_similar(positive=[diff_vector])

Because the common-case of analogy-solving requires a mixture of positive and negative vectors, the most_similar() method even lets you supply negative-examples, so you could also do the difference-and-most-similar in a single step:

wv.most_similar(positive=['sport'], negative=['football')

(The results might be slightly different than the 1st approach, due to some different ordering of unit-normalization that happens inside most_similar() .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM