简体   繁体   中英

Cosine similarity between two ndarrays

I have two numpy arrays, first array is of size 100*4*200, and second array is of size 150*6*200. In fact, I am storing the 100 samples of 200 dimensional vector representations of 4 fields in array 1 and 140 samples of 200 dimensional vectors of 6 fields in array 2.

Now I want to compute the similarity vector between the samples and create a similarity matrix. For each sample, I would like to calculate the similarity between the each combination of fields and store it such that I get a 15000*24 dimensional array.

First 150 rows will be the similarity vector between 1st row of array 1 and 150 rows of array 2, next 150 rows will be the similarity vector between the 2nd row of array 1 and 150 rows of array 2 etc. Each similarity vector is # fields in array 1 * # fields in array 2 ie 1st element of the similarity vector is cosine similarity between field 1 of array 1 and field 1 of array 2, 2nd element will be the similarity between field 1 of array 1 and field 2 of array 2 and so on with last element is the similarity between last field of array 1 and last field of array 2.

What is the best way to do this using numpy arrays ?

So every "row" (i assume the first axis, that I'll call axis 0) is the sample axis. That means you have 100 samples from one vector, each with fields x dimentions 4 x 200.

Doing this the way you describe, then the first row of the first array would have (4,200) and the second one would then have (150,6,200). Then you'd want to do a cos distance between an (m,n) , and (m,n,k) array, which does not make sense (the closest you have to a dot product here would be the tensor product, which I'm fairly sure is not what you want).

So we have to extract these first and then iterate over all the others.

To do this I actually recomend just splitting the array with np.split and iterate over both of them. This is just because I've never come across a faster way in numpy. You could use tensorflow to gain efficiency, but I'm not going into that here in my answer.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
a = np.random.rand(100, 4, 200)
b = np.random.rand(150, 6, 200)
# We know the output will be 150*100 x 6*4
c = np.empty([15000, 24])

# Make an array with the rows of a and same for b
a_splitted=np.split(a, a.shape[0], 0)
b_splitted=np.split(b, b.shape[0], 0)
i=0
for alpha in a_splitted:
    for beta in b_splitted:
        # Gives a 4x6 matrix
        sim=cosine_similarity(alpha[0],beta[0])
        c[i,:]=sim.ravel()
        i+=1

For the similarity -function above I just chose what @StefanFalk sugested: sklearn.metrics.pairwise.cosine_similarity . If this similarity measure is not sufficient, then you could either write your own.

I am not at all claiming that this is the best way to do this in all of python. I think the most efficient way is to do this symbolically using, as mentioned, tensorflow .

Anyways, hope it helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM