简体   繁体   English

具有公差的两个多维数组之间的相交 - NumPy / Python

[英]Intersection between two multi-dimensional arrays with tolerance - NumPy / Python

i am stuck at a problem. 我遇到了问题。 I have two 2-D numpy arrays, filled with x and y coordinates. 我有两个2-D numpy数组,填充x和y坐标。 Those arrays might look like: 这些数组可能如下所示:

array1([[(1.22, 5.64)],
   [(2.31, 7.63)],
   [(4.94, 4.15)]],

array2([[(1.23, 5.63)],
   [(6.31, 10.63)],
   [(2.32, 7.65)]],

Now I have to find "duplicate nodes". 现在我必须找到“重复节点”。 However, i also have to consider nodes as equal within a given tolerance of the coordinates, therefore, i can't use solutions like this . 但是,我也有坐标给定的公差范围内考虑节点作为平等的,因此,我不能使用像解决 Since my arrays are quite big (~200.000 lines each) two simple for loops are not an option as well. 由于我的数组很大(每个约200,000行),因此两个简单的for循环也不是一个选项。 My final output should look like this: 我的最终输出应如下所示:

output([[(1.23, 5.63)],
   [(2.32, 7.65)]],

I would appreciate some hints. 我会很感激一些提示。

Cheers, 干杯,

In order to compare to nodes with a giving tolerance I recommend to use numpy.isclose() , where you can set a relative and absolute tolerance. 为了与具有给定容差的节点进行比较,我建议使用numpy.isclose() ,您可以在其中设置相对和绝对容差。

numpy.isclose(1.24, 1.25, atol=1e-1)
# [True]
numpy.isclose([1.24, 2.31], [1.25, 2.32], atol=1e-1)
# [True, True]

Instead of using a two for loops, you can make use of itertools.product() package, to go through all pairs. 您可以使用itertools.product()包来代替使用两个for循环,以遍历所有对。 The following code does what you want: 以下代码执行您想要的操作:

array1 = np.array([[1.22, 5.64],
                   [2.31, 7.63],
                   [4.94, 4.15]])

array2 = np.array([[1.23, 5.63],
                   [6.31, 10.63],
                   [2.32, 7.64]])

output = np.empty((0,2))
for i0, i1 in itertools.product(np.arange(array1.shape[0]),
                                np.arange(array2.shape[0])):
    if np.all(np.isclose(array1[i0], array2[i1], atol=1e-1)):
         output = np.concatenate((output, [array2[i1]]), axis=0)
# output = [[ 1.23  5.63]
#           [ 2.32  7.64]]

Defining a isclose function similar to numpy.isclose , but a bit faster (mostly due to not checking any input and not supporting both relative and absolute tolerance): 定义类似于numpy.iscloseisclose函数, 但速度要快一些 (主要是由于没有检查任何输入而不支持相对和绝对容差):

import numpy as np

array1 = np.array([[(1.22, 5.64)],
                   [(2.31, 7.63)],
                   [(4.94, 4.15)]])

array2 = np.array([[(1.23, 5.63)],
                    [(6.31, 10.63)],
                    [(2.32, 7.65)]])

def isclose(x, y, atol):
    return np.abs(x - y) < atol

Now comes the hard part. 现在来了困难的部分。 We need to calculate if any two values are close within the inner most dimension. 我们需要计算在最内层维度内是否有任何两个值接近。 For this I reshape the arrays in such a way that the first array has its values along the second dimension, replicated across the first and the second array has its values along the first dimension, replicated along the second (note the 1, 3 and 3, 1 ): 为此,我重新整形数组,使第一个数组沿第二个维度具有值,在第一个数据上复制,第二个数组沿第一个维度具有其值,沿第二个维度复制(注意1, 33, 1 ):

In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]: 
array([[[ True,  True],
        [False, False],
        [False, False]],

       [[False, False],
        [False, False],
        [False, False]],

       [[False, False],
        [ True,  True],
        [False, False]]], dtype=bool)

Now we want all entries where the value is close to any other value (along the same dimension): 现在我们想要所有值接近任何其他值的条目(沿着相同的维度):

In [93]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0)
Out[93]: 
array([[ True,  True],
       [ True,  True],
       [False, False]], dtype=bool)

Then we want only those where both values of the tuple are close: 那么我们只想要那些元组的两个值都接近的那些:

In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True,  True, False], dtype=bool)

And finally, we can use this to index array1 : 最后,我们可以使用它来索引array1

In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]: 
array([[[ 1.22,  5.64]],

       [[ 2.31,  7.63]]])

If you want to, you can swap the any and all calls. 如果您愿意,可以交换anyall电话。 One might be faster than the other in your case. 在你的情况下,一个可能比另一个更快。

The 3 in the reshape calls needs to be substituted for the actual length of your data. reshape调用中的3需要替换数据的实际长度。

This algorithm will have the same bad runtime of the other answer using itertools.product , but at least the actual looping is done implicitly by numpy and is implemented in C. This is visible in the timings: 使用itertools.product ,此算法将具有与其他答案相同的错误运行时,但至少实际的循环由numpy隐式完成,并在C中实现。这在时间中可见:

In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Where the pares function is the code defined by @Ferran Parés in another answer and the arrays as already reshaped there. pares函数是@FerranParés另一个答案中定义的代码,而数组已经在那里重新形成。

And for larger arrays it becomes more obvious: 对于较大的阵列,它变得更加明显:

array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))

array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)

In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In the end this is limited by the available system memory. 最后,这受到可用系统内存的限制。 My machine (16GB RAM) can still handle arrays of length 20000, but that pushes it almost to 100%. 我的机器(16GB RAM)仍然可以处理长度为20000的阵列,但这几乎可以达到100%。 It also takes about 12s: 它也需要大约12秒:

In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There are many possible ways to define that tolerance. 有许多可能的方法来定义这种容差。 Since, we are talking about XY coordinates, most probably we are talking about euclidean distances to set that tolerance value. 因为,我们正在谈论XY坐标,我们很可能正在谈论欧氏距离来设置公差值。 So, we can use Cython-powered kd-tree for quick nearest-neighbor lookup , which is very efficient both memory-wise and with performance. 因此,我们可以使用Cython-powered kd-tree进行快速最近邻查找 ,这在内存和性能方面都非常有效。 The implementation would look something like this - 实现看起来像这样 -

from scipy.spatial import cKDTree

# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
    # Get closest distances for each pt in b
    dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor

    # Check the distances against the given tolerance value and 
    # thus filter out rows off b for the final output
    return b[dist <= tol]

Sample step-by-step run - 逐步运行示例 -

# Input 2D arrays
In [68]: a
Out[68]: 
array([[1.22, 5.64],
       [2.31, 7.63],
       [4.94, 4.15]])

In [69]: b
Out[69]: 
array([[ 1.23,  5.63],
       [ 6.31, 10.63],
       [ 2.32,  7.65]])

# Get closest distances for each pt in b
In [70]: dist = cKDTree(a).query(b, k=1)[0]

In [71]: dist
Out[71]: array([0.01414214, 5.        , 0.02236068])

# Mask of distances within the given tolerance
In [72]: tol = 1

In [73]: dist <= tol
Out[73]: array([ True, False,  True])

# Finally filter out valid ones off b
In [74]: b[dist <= tol]
Out[74]: 
array([[1.23, 5.63],
       [2.32, 7.65]])

Timings on 200,000 pts - 时间200,000分 -

In [20]: N = 200000
    ...: np.random.seed(0)
    ...: a = np.random.rand(N,2)
    ...: b = np.random.rand(N,2)

In [21]: %timeit intersect_close(a, b)
1 loop, best of 3: 1.37 s per loop

As commented, scaling and rounding your numbers might allow you to use intersect1d or the equivalent. 如评论所示,缩放和舍入数字可能允许您使用intersect1d或等效项。

And if you have just 2 columns, it might work to turn it into a 1d array of complex dtype. 如果你只有2列,它可能会将它变成一个复杂dtype的1d数组。

But you might also want to keep in mind what intersect1d does: 但您可能还想记住intersect1d作用:

if not assume_unique:
    # Might be faster than unique( intersect1d( ar1, ar2 ) )?
    ar1 = unique(ar1)
    ar2 = unique(ar2)
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]

unique has been enhanced to handle rows ( axis parameters), but intersect has not. unique已被增强以处理行( axis参数),但相交没有。 In any case it uses argsort to put similar elements next to each other, and then skips the duplicates. 在任何情况下,它使用argsort将相似的元素放在一起,然后跳过重复项。

Notice that intersect concatenenates the unique arrays, sorts, and again finds the duplicates. 请注意, intersect会对唯一数组进行concatenenate,排序,然后再次查找重复项。

I know you didn't want a loop version, but to promote conceptualization of the problem here's one anyways: 我知道你不想要一个循环版本,但是为了促进这个问题的概念化,无论如何:

In [581]: a = np.array([(1.22, 5.64),
     ...:    (2.31, 7.63),
     ...:    (4.94, 4.15)])
     ...: 
     ...: b = np.array([(1.23, 5.63),
     ...:    (6.31, 10.63),
     ...:    (2.32, 7.65)])
     ...:    

I removed a layer of nesting in your arrays. 我在你的数组中删除了一层嵌套。

In [582]: c = []
In [583]: for a1 in a:
     ...:     for b1 in b:
     ...:         if np.allclose(a1,b1, atol=0.5): c.append((a1,b1))

or as list comprehension 或者作为列表理解

In [586]: [(a1,b1) for a1 in a for b1 in b if np.allclose(a1,b1,atol=0.5)]
Out[586]: 
[(array([1.22, 5.64]), array([1.23, 5.63])),
 (array([2.31, 7.63]), array([2.32, 7.65]))]

complex approximation 复数近似

In [604]: aa = (a*10).astype(int)
In [605]: aa
Out[605]: 
array([[12, 56],
       [23, 76],
       [49, 41]])
In [606]: ac=aa[:,0]+1j*aa[:,1]
In [607]: bb = (b*10).astype(int)
In [608]: bc=bb[:,0]+1j*bb[:,1]
In [609]: np.intersect1d(ac,bc)
Out[609]: array([12.+56.j, 23.+76.j])

intersect inspired 相交的灵感

Concatenate the arrays, sort them, take difference, and find the small differences: 连接数组,对它们进行排序,获取差异,并找出小的差异:

In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]: 
array([[ 4.94,  4.15],
       [ 1.23,  5.63],
       [ 1.22,  5.64],
       [ 2.31,  7.63],
       [ 2.32,  7.65],
       [ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]: 
array([[-3.71,  1.48],
       [-0.01,  0.01],
       [ 1.09,  1.99],
       [ 0.01,  0.02],
       [ 3.99,  2.98]])

In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1)  # refine with abs
Out[623]: array([False,  True, False,  True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]: 
array([[2.31, 7.63],
       [1.23, 5.63]])

May be you could try this using pure NP and self defined function: 也许你可以尝试使用纯NP和自定义功能:

import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)

print(xDA)
print(yDA)

#Match x to y
def np_matrix(myx,myy,calp=0.2):
    Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
    Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)

    # define a caliper
    matches = {}
    dist = np.abs(Xxx - Yyy)
    for m in range(0, myx.size):
        if (np.min(dist[:, m]) <= calp) or not calp:
            matches[m] = np.argmin(dist[:, m])
    return matches


alwd_dist=0.1

xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)

shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
    print("No Matched Items based on given allowed distance:",alwd_dist)
else:
    print("Matched:")
    for ke in shared_items:
        print(xDA[ke[0]],yDA[ke[1]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM