简体   繁体   English

Gensim在半径为r的球中找到向量/单词

[英]Gensim find vectors/words in ball of radius r

I would like take word "book" (for example) get its vector representation, call it v_1 and find all words whose vector representation is within ball of radius r of v_1 ie ||v_1 - v_i||<=r, for some real number r. 我想将单词“ book”(例如)获取其矢量表示形式,将其称为v_1,并找到矢量表示位于v_1半径r的球内的所有单词,即|| v_1-v_i || <= r,对于某些实数数字r

I know gensim has most_similar function, which allows to state number of top vectors to return, but it is not quite what I need. 我知道gensim具有most_similar函数,该函数可以声明要返回的顶级向量的数量,但这并不是我所需要的。 I surely can use brute force search and get the answer, but it will be to slow. 我当然可以使用蛮力搜索并得到答案,但这会很慢。

If you call most_similar() with a topn=0 , it will return the raw unsorted cosine-similarities to all other words known to the model. 如果使用topn=0调用most_similar() ,它将返回原始未排序的余弦相似度到模型已知的所有其他词。 (These similarities will not be in tuples with the words, but simply in the same order as the words in the index2entity property.) (这些相似性不会与单词中的元组相似,而只是与index2entity属性中的单词相同的顺序。)

You could then filter those similarities for those higher than your preferred threshold, and return just those indexes/words, using a function like numpy 's argwhere . 然后,您可以使用诸如numpyargwhere类的功能过滤那些高于您的首选阈值的相似性,并仅返回那些索引/单词。

For example: 例如:

target_word = 'apple'
threshold = 0.9
all_sims = wv.most_similar(target_word, topn=0)
satisfactory_indexes = np.argwhere(all_sims > threshold)
satisfactory_words = [wv.index2entity[i] for i in satisfactory_indexes]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM