简体   繁体   English

加快阵列中所有可能对之间的距离

[英]Speeding up distance between all possible pairs in an array

I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here) 我有几个(~10 ^ 10)点的x,y,z坐标数组(此处仅显示5个)

a= [[ 34.45  14.13   2.17]
    [ 32.38  24.43  23.12]
    [ 33.19   3.28  39.02]
    [ 36.34  27.17  31.61]
    [ 37.81  29.17  29.94]]

I want to make a new array with only those points which are at least some distance d away from all other points in the list. 我想创建一个新数组,只包含距离列表中所有其他点至少距离d那些点。 I wrote a code using while loop, 我用while循环编写了一个代码,

 import numpy as np
 from scipy.spatial import distance 

 d=0.1 #or some distance 
 i=0
 selected_points=[]
 while i < len(a):
          interdist=[]  
          j=i+1
          while j<len(a):
              interdist.append(distance.euclidean(a[i],a[j]))
              j+=1

          if all(dis >= d for dis in interdist):
              np.array(selected_points.append(a[i]))
          i+=1

This works, but it is taking really long to perform this calculation. 这样可行,但执行此计算需要很长时间。 I read somewhere that while loops are very slow. 我读到某处, while循环很慢。

I was wondering if anyone has any suggestions on how to speed up this calculation. 我想知道是否有人对如何加快这个计算有任何建议。

EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i , it calculates the distances 1->2 , 1->3 , let's say 1->2 is less than the threshold distance d , so the code throws away particle 1 . 编辑:虽然我的目标是找到距离所有其他距离至少有一段距离的粒子,但我只是意识到我的代码中有一个严重的缺陷,假设我有3个粒子,我的代码执行以下操作,对于i的第一次迭代,它计算距离1->2 1->3 ,假设1->2小于阈值距离d ,因此代码抛弃粒子1 For the next iteration of i , it only does 2->3 , and let's say it finds that it is greater than d , so it keeps particle 2 , but this is wrong! 对于i的下一次迭代,它只做2->3 ,并且假设它发现它大于d ,所以它保持粒子2 ,但这是错误的! since 2 should also be discarded with particle 1 . 因为2也应该与粒子1一起丢弃。 The solution by @svohara is the correct one! @svohara的解决方案是正确的!

For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. 对于大数据集和低维点(例如三维数据),有时使用空间索引方法有很大好处。 One popular choice for low-dimensional data is the kd tree. 低维数据的一个流行选择是kd树。

The strategy is to index the data set. 策略是索引数据集。 Then query the index using the same data set, to return the 2-nearest neighbors for each point. 然后使用相同的数据集查询索引,以返回每个点的2个最近邻居。 The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). 第一个最近邻居总是点本身(dist = 0),所以我们真的想知道下一个最近点(第二个最近邻居)有多远。 For those points where the 2-NN is > threshold, you have the result. 对于那些2-NN>阈值的点,你得到了结果。

from scipy.spatial import cKDTree as KDTree
import numpy as np

#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')

# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
#  there are some parameters that could be tweaked for faster indexing,
#  and there are implementations (not in scipy) that can construct
#  the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)

#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4)  # to use more CPUs on query...

#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.

# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1   #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1]  #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )

Here's a vectorized approach using distance.pdist - 这是使用distance.pdist的矢量化方法 -

# Store number of pts (number of rows in a)
m = a.shape[0]

# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()

# Get the IDs of pairs of rows that are more than "d" apart and thus select 
# the rest of the rows using a boolean mask created with np.in1d for the 
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])] 

For a huge dataset like 10e10 , we might have to perform the operations in chunks based on the system memory available. 对于像10e10这样的大型数据集,我们可能必须根据可用的系统内存以块的形式执行操作。

  1. Drop the append, it must be really slow. 放下附件,它一定很慢。 You can have a static vector of distances and use [] to put the number in the right position. 您可以使用静态矢量距离并使用[]将数字放在正确的位置。

  2. Use min instead of all. 使用min而不是all。 You only need to check if the minimum distance is bigger than x. 您只需要检查最小距离是否大于x。

  3. Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. 实际上,你可以在你发现距离小于极限的那一刻打破你的追尾,然后你就可以剔除这两个点。 In this way you even do not have to save any distance (unless you need them later). 这样你甚至不必保存任何距离(除非你以后需要它们)。

    1. Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. 由于d(a,b)= d(b,a),您只能对以下几点进行内部循环,忘记已计算的距离。 If you need them you can pick the faster from the array. 如果你需要它们,你可以从阵列中选择更快。

From your comment, I believe this would do, if you have no repeated points. 从你的评论中,如果你没有重复的观点,我相信这会有所作为。

selected_points = []
for p1 in a:
    save_point = True
    for p2 in a:
        if p1!=p2 and distance.euclidean(p1,p2)<d:
            save_point = False
            break
    if save_point:
        selected_points.append(p1)

return selected_points

In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables. 最后,我检查a,b和b,因为你不应该在处理它时修改列表,但你可以更聪明地使用一些附加变量。

your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random. 你的算法是二次的(10 ^ 20次运算),如果分布几乎是随机的,这是一个线性方法。 Splits your space in boxes of size d/sqrt(3)^3 . 将您的空间拆分为大小为d/sqrt(3)^3框。 Put each points in its box. 将每个点放在其框中。

Then for each box, 然后为每个盒子,

  • if there is just one point, you just have to calculate distance with points in a little neighborhood. 如果只有一个点,你只需要在一个小邻域中计算距离。

  • else there is nothing to do. 否则无事可做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM