[英]Find cosine distance for all pairs of word2vec encodings without using nested loops
I need to calculate and store cosine distances for all pairs of words of a word2vec encoding . 我需要为word2vec编码的所有单词对计算和存储余弦距离 。 Each word is represented as a 4 * 1 vector stored in a pandas dataframe, with each element in the contunuous range [1, 9]. 每个单词表示为存储在pandas数据帧中的4 * 1向量,每个元素都在conunuous范围内[1,9]。 I need to store the result in a pandas dataframe so that it can be accessed in constant time. 我需要将结果存储在pandas数据帧中,以便可以在恒定时间内访问它。
I am unable to use the apply function of pandas library/lambda. 我无法使用pandas library / lambda的apply函数。 Using nested loops will take approx. 使用嵌套循环将需要大约。 9 hours (according to tqdm). 9小时(根据tqdm)。
word word1 word2 word3 ...
word1 d11 d12 d13...
word2 d21 d22 d23...
word3 d31 d32 d33...
.
.
.
If you were to use something like the Python gensim
library to load a pre-existing vector set (in the original word2vec.c format) into its KeyedVectors
representation, then the raw vectors will be in a numpy array in its vectors
property. 如果您使用类似Python gensim
库的东西将预先存在的矢量集(原始word2vec.c格式) KeyedVectors
到其KeyedVectors
表示中,那么原始矢量将在其vectors
属性中处于numpy数组中。 For example: 例如:
kv = KeyedVectors.load_word2vec_format('word_vectors.bin', binary=True)
print(kv.vectors.shape)
You could then use a library function like scikit-learn
's pairwise_distances()
to compute the distance matrix: 然后,您可以使用像scikit-learn
的pairwise_distances()
这样的库函数来计算距离矩阵:
from sklearn.metrics import pairwise_distances
distances = pairwise_distances(kv.vectors, metric="cosine")
Because the sklearn
routine uses optimized native math routines, it will likely be a lot faster than your initial loops-in-pure-Python approach. 因为sklearn
例程使用优化的本机数学例程,所以它可能比初始循环纯Python方法快得多。 Note, though, that the resulting distances matrix may be huge! 但请注意,得到的距离矩阵可能很大!
(You can find out which words are in which kv.vectors
slots via the list in kv.index2entity
, or look up the slot for a word via the dict in kv.vocab
.) (你可以找出哪些词是在kv.vectors
通过名单插槽kv.index2entity
,或查找插槽用于通过在字典一个字kv.vocab
。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.