[英]Intersection between two multi-dimensional arrays with tolerance - NumPy / Python
i am stuck at a problem. 我遇到了问题。 I have two 2-D numpy arrays, filled with x and y coordinates.
我有两个2-D numpy数组,填充x和y坐标。 Those arrays might look like:
这些数组可能如下所示:
array1([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]],
array2([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]],
Now I have to find "duplicate nodes". 现在我必须找到“重复节点”。 However, i also have to consider nodes as equal within a given tolerance of the coordinates, therefore, i can't use solutions like this .
但是,我也有坐标给定的公差范围内考虑节点作为平等的,因此,我不能使用像解决此 。 Since my arrays are quite big (~200.000 lines each) two simple
for
loops are not an option as well. 由于我的数组很大(每个约200,000行),因此两个简单的
for
循环也不是一个选项。 My final output should look like this: 我的最终输出应如下所示:
output([[(1.23, 5.63)],
[(2.32, 7.65)]],
I would appreciate some hints. 我会很感激一些提示。
Cheers, 干杯,
In order to compare to nodes with a giving tolerance I recommend to use numpy.isclose()
, where you can set a relative and absolute tolerance. 为了与具有给定容差的节点进行比较,我建议使用
numpy.isclose()
,您可以在其中设置相对和绝对容差。
numpy.isclose(1.24, 1.25, atol=1e-1)
# [True]
numpy.isclose([1.24, 2.31], [1.25, 2.32], atol=1e-1)
# [True, True]
Instead of using a two for
loops, you can make use of itertools.product()
package, to go through all pairs. 您可以使用
itertools.product()
包来代替使用两个for
循环,以遍历所有对。 The following code does what you want: 以下代码执行您想要的操作:
array1 = np.array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
array2 = np.array([[1.23, 5.63],
[6.31, 10.63],
[2.32, 7.64]])
output = np.empty((0,2))
for i0, i1 in itertools.product(np.arange(array1.shape[0]),
np.arange(array2.shape[0])):
if np.all(np.isclose(array1[i0], array2[i1], atol=1e-1)):
output = np.concatenate((output, [array2[i1]]), axis=0)
# output = [[ 1.23 5.63]
# [ 2.32 7.64]]
Defining a isclose
function similar to numpy.isclose
, but a bit faster (mostly due to not checking any input and not supporting both relative and absolute tolerance): 定义类似于
numpy.isclose
的isclose
函数, 但速度要快一些 (主要是由于没有检查任何输入而不支持相对和绝对容差):
import numpy as np
array1 = np.array([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]])
array2 = np.array([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]])
def isclose(x, y, atol):
return np.abs(x - y) < atol
Now comes the hard part. 现在来了困难的部分。 We need to calculate if any two values are close within the inner most dimension.
我们需要计算在最内层维度内是否有任何两个值接近。 For this I reshape the arrays in such a way that the first array has its values along the second dimension, replicated across the first and the second array has its values along the first dimension, replicated along the second (note the
1, 3
and 3, 1
): 为此,我重新整形数组,使第一个数组沿第二个维度具有值,在第一个数据上复制,第二个数组沿第一个维度具有其值,沿第二个维度复制(注意
1, 3
和3, 1
):
In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]:
array([[[ True, True],
[False, False],
[False, False]],
[[False, False],
[False, False],
[False, False]],
[[False, False],
[ True, True],
[False, False]]], dtype=bool)
Now we want all entries where the value is close to any other value (along the same dimension): 现在我们想要所有值接近任何其他值的条目(沿着相同的维度):
In [93]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0)
Out[93]:
array([[ True, True],
[ True, True],
[False, False]], dtype=bool)
Then we want only those where both values of the tuple are close: 那么我们只想要那些元组的两个值都接近的那些:
In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True, True, False], dtype=bool)
And finally, we can use this to index array1
: 最后,我们可以使用它来索引
array1
:
In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]:
array([[[ 1.22, 5.64]],
[[ 2.31, 7.63]]])
If you want to, you can swap the any
and all
calls. 如果您愿意,可以交换
any
和all
电话。 One might be faster than the other in your case. 在你的情况下,一个可能比另一个更快。
The 3
in the reshape
calls needs to be substituted for the actual length of your data. reshape
调用中的3
需要替换数据的实际长度。
This algorithm will have the same bad runtime of the other answer using itertools.product
, but at least the actual looping is done implicitly by numpy
and is implemented in C. This is visible in the timings: 使用
itertools.product
,此算法将具有与其他答案相同的错误运行时,但至少实际的循环由numpy
隐式完成,并在C中实现。这在时间中可见:
In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Where the pares
function is the code defined by @Ferran Parés in another answer and the arrays as already reshaped there. pares
函数是@FerranParés在另一个答案中定义的代码,而数组已经在那里重新形成。
And for larger arrays it becomes more obvious: 对于较大的阵列,它变得更加明显:
array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)
In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In the end this is limited by the available system memory. 最后,这受到可用系统内存的限制。 My machine (16GB RAM) can still handle arrays of length 20000, but that pushes it almost to 100%.
我的机器(16GB RAM)仍然可以处理长度为20000的阵列,但这几乎可以达到100%。 It also takes about 12s:
它也需要大约12秒:
In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are many possible ways to define that tolerance. 有许多可能的方法来定义这种容差。 Since, we are talking about XY coordinates, most probably we are talking about euclidean distances to set that tolerance value.
因为,我们正在谈论XY坐标,我们很可能正在谈论欧氏距离来设置公差值。 So, we can use
Cython-powered kd-tree
for quick nearest-neighbor lookup , which is very efficient both memory-wise and with performance. 因此,我们可以使用
Cython-powered kd-tree
进行快速最近邻查找 ,这在内存和性能方面都非常有效。 The implementation would look something like this - 实现看起来像这样 -
from scipy.spatial import cKDTree
# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
# Get closest distances for each pt in b
dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor
# Check the distances against the given tolerance value and
# thus filter out rows off b for the final output
return b[dist <= tol]
Sample step-by-step run - 逐步运行示例 -
# Input 2D arrays
In [68]: a
Out[68]:
array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
In [69]: b
Out[69]:
array([[ 1.23, 5.63],
[ 6.31, 10.63],
[ 2.32, 7.65]])
# Get closest distances for each pt in b
In [70]: dist = cKDTree(a).query(b, k=1)[0]
In [71]: dist
Out[71]: array([0.01414214, 5. , 0.02236068])
# Mask of distances within the given tolerance
In [72]: tol = 1
In [73]: dist <= tol
Out[73]: array([ True, False, True])
# Finally filter out valid ones off b
In [74]: b[dist <= tol]
Out[74]:
array([[1.23, 5.63],
[2.32, 7.65]])
Timings on 200,000 pts - 时间200,000分 -
In [20]: N = 200000
...: np.random.seed(0)
...: a = np.random.rand(N,2)
...: b = np.random.rand(N,2)
In [21]: %timeit intersect_close(a, b)
1 loop, best of 3: 1.37 s per loop
As commented, scaling and rounding your numbers might allow you to use intersect1d
or the equivalent. 如评论所示,缩放和舍入数字可能允许您使用
intersect1d
或等效项。
And if you have just 2 columns, it might work to turn it into a 1d array of complex dtype. 如果你只有2列,它可能会将它变成一个复杂dtype的1d数组。
But you might also want to keep in mind what intersect1d
does: 但您可能还想记住
intersect1d
作用:
if not assume_unique:
# Might be faster than unique( intersect1d( ar1, ar2 ) )?
ar1 = unique(ar1)
ar2 = unique(ar2)
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]
unique
has been enhanced to handle rows ( axis
parameters), but intersect has not. unique
已被增强以处理行( axis
参数),但相交没有。 In any case it uses argsort
to put similar elements next to each other, and then skips the duplicates. 在任何情况下,它使用
argsort
将相似的元素放在一起,然后跳过重复项。
Notice that intersect
concatenenates the unique arrays, sorts, and again finds the duplicates. 请注意,
intersect
会对唯一数组进行concatenenate,排序,然后再次查找重复项。
I know you didn't want a loop version, but to promote conceptualization of the problem here's one anyways: 我知道你不想要一个循环版本,但是为了促进这个问题的概念化,无论如何:
In [581]: a = np.array([(1.22, 5.64),
...: (2.31, 7.63),
...: (4.94, 4.15)])
...:
...: b = np.array([(1.23, 5.63),
...: (6.31, 10.63),
...: (2.32, 7.65)])
...:
I removed a layer of nesting in your arrays. 我在你的数组中删除了一层嵌套。
In [582]: c = []
In [583]: for a1 in a:
...: for b1 in b:
...: if np.allclose(a1,b1, atol=0.5): c.append((a1,b1))
or as list comprehension 或者作为列表理解
In [586]: [(a1,b1) for a1 in a for b1 in b if np.allclose(a1,b1,atol=0.5)]
Out[586]:
[(array([1.22, 5.64]), array([1.23, 5.63])),
(array([2.31, 7.63]), array([2.32, 7.65]))]
In [604]: aa = (a*10).astype(int)
In [605]: aa
Out[605]:
array([[12, 56],
[23, 76],
[49, 41]])
In [606]: ac=aa[:,0]+1j*aa[:,1]
In [607]: bb = (b*10).astype(int)
In [608]: bc=bb[:,0]+1j*bb[:,1]
In [609]: np.intersect1d(ac,bc)
Out[609]: array([12.+56.j, 23.+76.j])
Concatenate the arrays, sort them, take difference, and find the small differences: 连接数组,对它们进行排序,获取差异,并找出小的差异:
In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]:
array([[ 4.94, 4.15],
[ 1.23, 5.63],
[ 1.22, 5.64],
[ 2.31, 7.63],
[ 2.32, 7.65],
[ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]:
array([[-3.71, 1.48],
[-0.01, 0.01],
[ 1.09, 1.99],
[ 0.01, 0.02],
[ 3.99, 2.98]])
In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1) # refine with abs
Out[623]: array([False, True, False, True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]:
array([[2.31, 7.63],
[1.23, 5.63]])
May be you could try this using pure NP and self defined function: 也许你可以尝试使用纯NP和自定义功能:
import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
print(xDA)
print(yDA)
#Match x to y
def np_matrix(myx,myy,calp=0.2):
Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)
# define a caliper
matches = {}
dist = np.abs(Xxx - Yyy)
for m in range(0, myx.size):
if (np.min(dist[:, m]) <= calp) or not calp:
matches[m] = np.argmin(dist[:, m])
return matches
alwd_dist=0.1
xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)
shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
print("No Matched Items based on given allowed distance:",alwd_dist)
else:
print("Matched:")
for ke in shared_items:
print(xDA[ke[0]],yDA[ke[1]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.