简体   繁体   English

在容差内找到Python中两个矩阵的交集?

[英]Finding intersection of two matrices in Python within a tolerance?

I'm looking for the most efficient way of finding the intersection of two different-sized matrices. 我正在寻找最有效的方法来找到两个不同大小的矩阵的交集。 Each matrix has three variables (columns) and a varying number of observations (rows). 每个矩阵有三个变量(列)和不同数量的观察(行)。 For example, matrix A: 例如,矩阵A:

a = np.matrix('1 5 1003; 2 4 1002; 4 3 1008; 8 1 2005')
b = np.matrix('7 9 1006; 4 4 1007; 7 7 1050; 8 2 2003'; 9 9 3000; 7 7 1000')

If I set the tolerance for each column as col1 = 1 , col2 = 2 , and col3 = 10 , I would want a function such that it would output the indices in a and b that are within their respective tolerance, for example: 如果我将每列的容差设置为col1 = 1col2 = 2col3 = 10 ,我想要一个函数,使得它将输出ab中各自容差范围内的索引,例如:

[x1, x2] = func(a, b, col1, col2, col3)
print x1
>> [2 3]
print x2
>> [1 3]

You can see by the indices, that element 2 of a is within the tolerances of element 1 of b . 您可以通过索引看到, a的元素2在b的元素1的容差范围内。

I'm thinking I could loop through each element of matrix a , check if it's within the tolerances of each element in b , and do it that way. 我想我可以循环遍历矩阵a每个元素,检查它是否在b中每个元素的容差范围内,并且这样做。 But it seems inefficient for very large data sets. 但对于非常大的数据集来说似乎效率低下。

Any suggestions for alternatives to a looping method for accomplishing this? 有关实现此循环方法的替代方法的任何建议吗?

If you don't mind working with NumPy arrays, you could exploit broadcasting for a vectorized solution. 如果您不介意使用NumPy阵列,则可以利用broadcasting来实现矢量化解决方案。 Here's the implementation - 这是实施 -

# Set tolerance values for each column
tol = [1, 2, 10]

# Get absolute differences between a and b keeping their columns aligned
diffs = np.abs(np.asarray(a[:,None]) - np.asarray(b))

# Compare each row with the triplet from `tol`.
# Get mask of all matching rows and finally get the matching indices
x1,x2 = np.nonzero((diffs < tol).all(2))

Sample run - 样品运行 -

In [46]: # Inputs
    ...: a=np.matrix('1 5 1003; 2 4 1002; 4 3 1008; 8 1 2005')
    ...: b=np.matrix('7 9 1006; 4 4 1007; 7 7 1050; 8 2 2003; 9 9 3000; 7 7 1000')
    ...: 

In [47]: # Set tolerance values for each column
    ...: tol = [1, 2, 10]
    ...: 
    ...: # Get absolute differences between a and b keeping their columns aligned
    ...: diffs = np.abs(np.asarray(a[:,None]) - np.asarray(b))
    ...: 
    ...: # Compare each row with the triplet from `tol`.
    ...: # Get mask of all matching rows and finally get the matching indices
    ...: x1,x2 = np.nonzero((diffs < tol).all(2))
    ...: 

In [48]: x1,x2
Out[48]: (array([2, 3]), array([1, 3]))

Large datasizes case : If you are working with huge datasizes that cause memory issues and since you already know that the number of columns is a small number 3 , you might want to have a minimal loop of 3 iterations and save huge memory footprint, like so - 大型数据集案例:如果您正在处理导致内存问题的大型数据,并且因为您已经知道列数是3 ,那么您可能希望拥有3次迭代的最小循环并节省大量内存,如此 -

na = a.shape[0]
nb = b.shape[0]
accum = np.ones((na,nb),dtype=bool)
for i in range(a.shape[1]):
    accum &=  np.abs((a[:,i] - b[:,i].ravel())) < tol[i]
x1,x2 = np.nonzero(accum)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM