简体   繁体   English

使用numpy矩阵计算距离的Pythonic方法?

[英]Pythonic way to calculate distance using numpy matrices?

I have a list of points in a numpy matrix, 我有一个numpy矩阵中的点列表,

A = [[x11,x12,x13],[x21,x22,x23] ]

and I have a point origin o= [o1,o2,o3] from which I have to compute distance for every point, 我有一个点起源o= [o1,o2,o3] ,我必须从中计算每个点的距离,

A - o will subtract o from every point. A - o将从每个点减去o Currently I have to do the square of every attribute and addition operation, I am doing in the for loop. 目前我必须做每个属性和加法运算的平方,我在for循环中做。 Is there a more intuitive way to do this? 有更直观的方法吗?

PS: I am doing the above calculation as port of kmeans clustering application. PS:我正在做上述计算作为kmeans集群应用程序的端口。 I have computed centroids and now I have to computer distance for every point from the centroid. 我已经计算了质心,现在我必须从质心的每个点计算机距离。

input_mat = input_data_per_minute.values[:,2:5]

scaled_input_mat = scale2(input_mat)

k_means = cluster.KMeans(n_clusters=5)

print 'training start'
k_means.fit(scaled_input_mat)
print 'training over'

out = k_means.cluster_centers_

I have to compute the distance between input_mat and each cluster centroid. 我必须计算input_mat和每个集群质心之间的距离。

Numpy solution: Numpy解决方案:

Numpy is great with broadcasting so you can trick it to do all distances in one step. Numpy非常适合广播,所以你可以一步到位地去做所有距离。 But it will consume a lot of memory depending on the number of points and cluster centers. 但它会占用大量内存,具体取决于点数和集群中心。 In fact it will create a number_of_points * number_of_cluster_centers * 3 array: 实际上它会创建一个number_of_points * number_of_cluster_centers * 3数组:

First you need to know a bit about broadcasting, I'll play it self and define each dimension by hand. 首先你需要了解一下广播,我会自己动手并定义每个尺寸。

I'll start by defining some points and centers for illustration purposes: 我将首先定义一些点和中心以用于说明目的:

import numpy as np

points = np.array([[1,1,1],
                   [2,1,1],
                   [1,2,1],
                   [5,5,5]])

centers = np.array([[1.5, 1.5, 1],
                    [5,5,5]])

Now I'll prepare these arrays so that I can use numpy broadcasting to get the distance in each dimension: 现在我将准备这些数组,以便我可以使用numpy广播来获得每个维度的距离:

distance_3d = points[:,None,:] - centers[None,:,:]

Effectivly the first dimension is now the points "label", the second dimension is the centers "label" and the third dimension is the coordinate. 有效地,第一个维度现在是点“标签”,第二个维度是中心“标签”,第三个维度是坐标。 The subtraction is to get the distance in each dimension. 减法是为了获得每个维度的距离。 The result will have a shape: 结果将有一个形状:

(number_of_points, number_of_cluster_centers, 3)

now it's only a matter of applying the formula of the euclidean distance: 现在只需要应用欧几里德距离的公式:

# Square each distance
distance_3d_squared = distance_3d ** 2

# Take the sum of each coordinates distance (the result will be 2D)
distance_sum = np.sum(distance_3d_squared, axis=2)

# And take the square root
distance = np.sqrt(distance_sum)

For my test data the final result is: 对于我的测试数据,最终结果是:

#array([[ 0.70710678,  6.92820323],
#       [ 0.70710678,  6.40312424],
#       [ 0.70710678,  6.40312424],
#       [ 6.36396103,  0.        ]])

So the distance[i, j] element will give you the distance of point i to the center j . 因此distance[i, j]元素将给出点i到中心j的距离。

Summary: 摘要:

You can put all of this in one-line: 您可以将所有这些放在一行中:

distance2 = np.sqrt(np.sum((points[:,None,:] - centers[None,:,:]) ** 2, axis=2))

Scipy solution (faster & shorter): Scipy解决方案(更快更短):

or if you have scipy use cdist : 或者如果你有scipy使用cdist

from scipy.spatial.distance import cdist
distance3 = cdist(points, centers)

The result will always be the same but cdist is the fastest for lots of points and centers. 结果将始终相同,但cdist是许多积分和中心的最快。

You should be able to do something like this: (assuming I read your question right ;) ) 你应该能够做到这样的事情:(假设我正确地读了你的问题;))

In [1]: import numpy as np

In [2]: a = np.array([[11,12,13],[21,22,23]])

In [3]: o = [1,2,3]

In [4]: a - o  # just showing
Out[4]: 
array([[10, 10, 10],
       [20, 20, 20]])

In [5]: a ** 2  # just showing
Out[5]: 
array([[121, 144, 169],
       [441, 484, 529]])

In [6]: b = (a ** 2) + (a - o)

In [7]: b
Out[7]: 
array([[131, 154, 179],
       [461, 504, 549]])

Numpy is great because it moves through the array element-wise! Numpy很棒,因为它通过数组元素移动! This means that 90+% of the time you can iterate the array without a for-loop. 这意味着90%以上的时间可以在没有for循环的情况下迭代数组。 Using a for-loop outside of the array also significantly slower. 在阵列外部使用for循环也明显变慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM