简体   繁体   中英

Comparing NumPy Arrays for Similarity

I have a target NumPy array with shape (300,) and a set of candidate arrays also of shape (300,). These arrays are Word2Vec representations of words; I'm trying to find the candidate word that is most similar to the target word using their vector representations. What's the best way to find the candidate word that is most similar to the target word?

One way to do this is to sum up the absolute values of the element-wise differences between the target word and the candidate words, then select the candidate word with the lowest overall absolute difference. For example:

candidate_1_difference = np.subtract(target_vector, candidate_vector)
candidate_1_abs_difference = np.absolute(candidate_1_difference)
candidate_1_total_difference = np.sum(candidate_1_abs_difference)

Yet, this seems clunky and potentially wrong. What's a better way to do this?

Edit to include example vectors:

import numpy as np
import gensim

path = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'


def func1(path):
    #Limited to 50K words to reduce load time
    model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True, limit=50000)
    context =  ['computer','mouse','keyboard']
    candidates = ['office','house','winter']
    vectors_to_sum = []
    for word in context:
        vectors_to_sum.append(model.wv[word])
    target_vector = np.sum(vectors_to_sum)

    candidate_vector = candidates[0]
    candidate_1_difference = np.subtract(target_vector, candidate_vector)
    candidate_1_abs_difference = np.absolute(candidate_1_difference)
    candidate_1_total_difference = np.sum(candidate_1_abs_difference)
    return candidate_1_total_difference

What you have is basically correct. You are calculating the L1-norm, which is the sum of absolute differences. Another more common option is to calculate the euclidean norm, or the L2-norm, which is the familiar distance measure of square root of sum of squares.

You can use numpy.linalg.norm to calculate the different norms, which by default calculates the L-2 norm for vectors.

distance = np.linalg.norm(target_vector - candidate_vector)

If you have one target vector and multiple candidate vectors stored in a list, the above still works, but you need to specify the axis for norm, and then you get a vector of norms, one for each candidate vector.

for list of candidate vectors:

distance = np.linalg.norm(target_vector - np.array(candidate_vector), axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM