简体   繁体   English

比较NumPy数组的相似性

[英]Comparing NumPy Arrays for Similarity

I have a target NumPy array with shape (300,) and a set of candidate arrays also of shape (300,). 我有一个形状为(300,)的目标NumPy数组和一组形状也是(300,)的候选数组。 These arrays are Word2Vec representations of words; 这些数组是单词的Word2Vec表示形式。 I'm trying to find the candidate word that is most similar to the target word using their vector representations. 我正在尝试使用其向量表示法找到与目标单词最相似的候选单词。 What's the best way to find the candidate word that is most similar to the target word? 找到与目标单词最相似的候选单词的最佳方法是什么?

One way to do this is to sum up the absolute values of the element-wise differences between the target word and the candidate words, then select the candidate word with the lowest overall absolute difference. 一种方法是对目标词和候选词之间的逐元素差异的绝对值求和,然后选择总体绝对差最低的候选词。 For example: 例如:

candidate_1_difference = np.subtract(target_vector, candidate_vector)
candidate_1_abs_difference = np.absolute(candidate_1_difference)
candidate_1_total_difference = np.sum(candidate_1_abs_difference)

Yet, this seems clunky and potentially wrong. 但是,这似乎很笨拙,而且可能是错误的。 What's a better way to do this? 有什么更好的方法?

Edit to include example vectors: 编辑以包含示例向量:

import numpy as np
import gensim

path = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'


def func1(path):
    #Limited to 50K words to reduce load time
    model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True, limit=50000)
    context =  ['computer','mouse','keyboard']
    candidates = ['office','house','winter']
    vectors_to_sum = []
    for word in context:
        vectors_to_sum.append(model.wv[word])
    target_vector = np.sum(vectors_to_sum)

    candidate_vector = candidates[0]
    candidate_1_difference = np.subtract(target_vector, candidate_vector)
    candidate_1_abs_difference = np.absolute(candidate_1_difference)
    candidate_1_total_difference = np.sum(candidate_1_abs_difference)
    return candidate_1_total_difference

What you have is basically correct. 您所拥有的基本上是正确的。 You are calculating the L1-norm, which is the sum of absolute differences. 您正在计算L1范数,它是绝对差之和。 Another more common option is to calculate the euclidean norm, or the L2-norm, which is the familiar distance measure of square root of sum of squares. 另一个更常见的选择是计算欧几里得范数或L2-范数,这是平方和的平方根的熟悉的距离度量。

You can use numpy.linalg.norm to calculate the different norms, which by default calculates the L-2 norm for vectors. 您可以使用numpy.linalg.norm来计算不同的规范,默认情况下,该规范会计算矢量的L-2规范。

distance = np.linalg.norm(target_vector - candidate_vector)

If you have one target vector and multiple candidate vectors stored in a list, the above still works, but you need to specify the axis for norm, and then you get a vector of norms, one for each candidate vector. 如果您在列表中存储了一个目标向量和多个候选向量,以上方法仍然有效,但是您需要指定范数轴,然后获得一个范数向量,每个候选向量一个。

for list of candidate vectors: 有关候选向量的列表:

distance = np.linalg.norm(target_vector - np.array(candidate_vector), axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM