简体   繁体   中英

Gensim find vectors/words in ball of radius r

I would like take word "book" (for example) get its vector representation, call it v_1 and find all words whose vector representation is within ball of radius r of v_1 ie ||v_1 - v_i||<=r, for some real number r.

I know gensim has most_similar function, which allows to state number of top vectors to return, but it is not quite what I need. I surely can use brute force search and get the answer, but it will be to slow.

If you call most_similar() with a topn=0 , it will return the raw unsorted cosine-similarities to all other words known to the model. (These similarities will not be in tuples with the words, but simply in the same order as the words in the index2entity property.)

You could then filter those similarities for those higher than your preferred threshold, and return just those indexes/words, using a function like numpy 's argwhere .

For example:

target_word = 'apple'
threshold = 0.9
all_sims = wv.most_similar(target_word, topn=0)
satisfactory_indexes = np.argwhere(all_sims > threshold)
satisfactory_words = [wv.index2entity[i] for i in satisfactory_indexes]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM