简体   繁体   中英

How to find pairs of values greater than a certain cosine distance value?

I have an array:

[[ 0.32730174 -0.1436172  -0.3355202  -0.2982458 ]
 [ 0.50490916 -0.33826587  0.4315952   0.4850834 ]
 [-0.18594801 -0.06028342 -0.24817085 -0.41029227]
 [-0.22551994  0.47151482 -0.39798814 -0.14978702]
 [-0.3315491   0.05832376 -0.29526958  0.3786153 ]]

I have calculated its cosine distance with "pdist", cosine_distance=1-pdist(array, metric='cosine') and got the distance array:

[-0.14822659  0.51635946  0.09485546 -0.38855427 -0.82434624 -0.86407176
 -0.25101774  0.49793639 -0.07881047  0.41272145]

Now, I want to get only those pairs which's cosine distance is greater than 0.4 and less than 0.49. I have figured out the number of values which is greater than 0.4, by number_points=len([1 for i in cosine_distance if i >= 0.4]) . But not able to get those pairs.

The trick is in the description of the output for pdist .

Y : ndarray

Returns a condensed distance matrix Y. For each and (where ),where m is the number of original observations. The metric dist(u=X[i], v=X[j]) is computed and stored in entry ij.

The documentation also refers to squareform to make the distance vector a matrix again. The documentation explanation of the output array makes sense then. The ij position in the documentation will be the first and second index of the matrix created by the squareform operation. We can then get every distance regarding every point pair.

distance_matrix = squareform(cosine_distances_array)
points_to_keep = []

for (i in range(len(points)-1)):
    for (j in range(i+1, len(points))):
        if(distance_matrix[i,j] > 0.4):
            points_to_keep.push((points[i], points[j]))

print points_to_keep

Why not

number_points=len([1 for i in cosine_distance if i >= 0.4 and i <= 0.49])

If you need to keep track of which pair is in the range, use enumerate

number_points = [idx for idx, i in enumerate(cosine_distance) if i >= 0.4 and i <= 0.49]

This gives you a list with the indexes of the pairs which satisfies the conditions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM