简体   繁体   English

查找所有word2vec编码对的余弦距离,而不使用嵌套循环

[英]Find cosine distance for all pairs of word2vec encodings without using nested loops

I need to calculate and store cosine distances for all pairs of words of a word2vec encoding . 我需要为word2vec编码的所有单词对计算和存储余弦距离 Each word is represented as a 4 * 1 vector stored in a pandas dataframe, with each element in the contunuous range [1, 9]. 每个单词表示为存储在pandas数据帧中的4 * 1向量,每个元素都在conunuous范围内[1,9]。 I need to store the result in a pandas dataframe so that it can be accessed in constant time. 我需要将结果存储在pandas数据帧中,以便可以在恒定时间内访问它。

I am unable to use the apply function of pandas library/lambda. 我无法使用pandas library / lambda的apply函数。 Using nested loops will take approx. 使用嵌套循环将需要大约。 9 hours (according to tqdm). 9小时(根据tqdm)。

word     word1    word2    word3 ...
word1    d11      d12      d13...
word2    d21      d22      d23...
word3    d31      d32      d33...
.
.
.

If you were to use something like the Python gensim library to load a pre-existing vector set (in the original word2vec.c format) into its KeyedVectors representation, then the raw vectors will be in a numpy array in its vectors property. 如果您使用类似Python gensim库的东西将预先存在的矢量集(原始word2vec.c格式) KeyedVectors到其KeyedVectors表示中,那么原始矢量将在其vectors属性中处于numpy数组中。 For example: 例如:

kv = KeyedVectors.load_word2vec_format('word_vectors.bin', binary=True)
print(kv.vectors.shape)

You could then use a library function like scikit-learn 's pairwise_distances() to compute the distance matrix: 然后,您可以使用像scikit-learnpairwise_distances()这样的库函数来计算距离矩阵:

from sklearn.metrics import pairwise_distances
distances = pairwise_distances(kv.vectors, metric="cosine")

Because the sklearn routine uses optimized native math routines, it will likely be a lot faster than your initial loops-in-pure-Python approach. 因为sklearn例程使用优化的本机数学例程,所以它可能比初始循环纯Python方法快得多。 Note, though, that the resulting distances matrix may be huge! 但请注意,得到的距离矩阵可能很大!

(You can find out which words are in which kv.vectors slots via the list in kv.index2entity , or look up the slot for a word via the dict in kv.vocab .) (你可以找出哪些词是在kv.vectors通过名单插槽kv.index2entity ,或查找插槽用于通过在字典一个字kv.vocab 。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM