简体   繁体   English

在没有重复的情况下找到两个 numpy arrays 之间最近对的最快方法

[英]Fastest way to find the nearest pairs between two numpy arrays without duplicates

Given two large numpy arrays A and B with different number of rows ( len(B) > len(A) ) but same number of columns ( A.shape[1] = B.shape[1] = 3 ).给定两个大的 numpy arrays AB具有不同的行数( len(B) > len(A) )但相同的列数( A.shape[1] = B.shape[1] = 3 )。 I want to know the fastest way to get a subset C from B that has the minimum total distance (sum of all pair-wise distances) to A without duplicates (each pair must be both unique).我想知道从B获得子集C的最快方法,该子集具有最小距离(所有成对距离的总和)到A而没有重复(每对必须都是唯一的)。 This means C should have the same shape as A .这意味着C应该具有与A相同的形状。

Below is my code, but there are two main issues:下面是我的代码,但有两个主要问题:

  1. I cannot tell if this gives the minimum total distance我不知道这是否给出了最小距离
  2. In reality I have a much more expensive distance-calculating function rather than np.linalg.norm (needs to take care of periodic boundary conditions).实际上,我有一个更昂贵的距离计算 function 而不是np.linalg.norm (需要注意周期性边界条件)。 I think this is definitely not the fastest way to go since the code below calls the distance-calculating function one pair per time.我认为这绝对不是 go 的最快方法,因为下面的代码每次调用一对距离计算 function。 There is a significant overhead when I call the more expensive distance-calculating function and it will run forever.当我调用更昂贵的距离计算 function 并且它将永远运行时,会有很大的开销。 Any suggestions?有什么建议么?
import numpy as np
from operator import itemgetter
import random
import time

A = 100.*np.random.rand(1000, 3)
B = A.copy()
for (i,j), _ in np.ndenumerate(B):
    B[i,j] += np.random.rand()
B = np.vstack([B, 100.*np.random.rand(500, 3)])

def calc_dist(x, y):
    return np.linalg.norm(x - y)

t0 = time.time()
taken = []
for rowi in A:
    res = min(((k, calc_dist(rowi, rowj)) for k, rowj in enumerate(B)
                if k not in taken), key=itemgetter(1))
    taken.append(res[0])

C = B[taken]

print(A.shape, B.shape, C.shape)
>>> (1000, 3) (1500, 3) (1000, 3)

print(time.time() - t0)
>>> 12.406389951705933

Edit: for those who are interested in the expensive distance-calculating function, it uses the ase package (can be installed by pip install ase )编辑:对于那些对昂贵的距离计算 function 感兴趣的人,它使用ase package (可以通过pip install ase

from ase.geometry import find_mic
def calc_mic_dist(x, y):
    return find_mic(np.array([x]) - np.array([y]), 
                    cell=np.array([[50., 0.0, 0.0], 
                                   [25., 45., 0.0], 
                                   [0.0, 0.0, 100.]]))[1][0]

If you're OK with calculating the whole N² distances, which isn't that expensive for the sizes you've given, scipy.optimize has a function that will solve this directly.如果您可以计算整个 N² 距离,这对于您给出的尺寸来说并不昂贵, scipy.optimize有一个 function 可以直接解决这个问题。

import scipy.optimize
cost = np.linalg.norm(A[:, np.newaxis, :] - B, axis=2)
_, indexes = scipy.optimize.linear_sum_assignment(cost)
C = B[indexes]

Using the power of numpy broadcasting and vectorization利用 numpy 广播和矢量化的强大功能

find_mic method in ase.geometry can handle 2d np arrays. find_mic中的ase.geometry方法可以处理 2d np arrays。

from ase.geometry import find_mic
def calc_mic_dist(x, y):
    return find_mic(x - y, 
                    cell=np.array([[50., 0.0, 0.0], 
                                   [25., 45., 0.0], 
                                   [0.0, 0.0, 100.]]))[1]

Test:测试:

x = np.random.randn(1,3)
y = np.random.randn(5,3)

print (calc_mic_dist(x,y).shape)
# It is a distance metrics so:
assert np.allclose(calc_mic_dist(x,y), calc_mic_dist(y,x))

Ouptput:输出:

(5,)

As you can see the metrics is calculated for each value of x with each value of y , because xy in numpy does the magic of broadcasting.如您所见,指标是针对x的每个值和y的每个值计算的,因为 numpy 中的xy具有广播的魔力。

Solution:解决方案:

def calc_mic_dist(x, y):
    return find_mic(x - y, 
                    cell=np.array([[50., 0.0, 0.0], 
                                   [25., 45., 0.0], 
                                   [0.0, 0.0, 100.]]))[1]

t0 = time.time()
A = 100.*np.random.rand(1000, 3)
B = 100.*np.random.rand(5000, 3)
selected = [np.argmin(calc_mic_dist(a, B)) for a in A]
C = B[selected]
print (A.shape, B.shape, C.shape)

print (f"Time: {time.time()-t0}")

Output: Output:

(1000, 3) (5000, 3) (1000, 3)
Time: 9.817562341690063

Takes around 10secs on google collab谷歌合作大约需要 10 秒

Testing:测试:

We know that calc_mic_dist(x,x) == 0 so If A is a subset of B then C should exactly be A我们知道calc_mic_dist(x,x) == 0所以如果AB的子集,那么C应该正好是A

A = 100.*np.random.rand(1000, 3)
B = np.vstack([100.*np.random.rand(500, 3), A, 100.*np.random.rand(500, 3)])
selected = [np.argmin(calc_mic_dist(a, B)) for a in A]
C = B[selected]
print (A.shape, B.shape, C.shape)
print (np.allclose(A,C))

Output: Output:

(1000, 3) (2000, 3) (1000, 3)
True

Edit 1: Avoid duplicates编辑1:避免重复

Once a vector in B is selected it cannot be again selected for other values of A一旦选择了B中的向量,就不能再次为A的其他值选择它

This can be achieved by remove the selected vector from B once it is selected so that it does not appear again for next rows of A as a possible candidate.这可以通过从B中删除选定的向量来实现,一旦它被选中,它就不会再次出现在A的下一行作为可能的候选者。

A = 100.*np.random.rand(1000, 3)
B = np.vstack([100.*np.random.rand(500, 3), A, 100.*np.random.rand(500, 3)])

B_ = B.copy()
C = np.zeros_like(A)

for i, a in enumerate(A):
  s = np.argmin(calc_mic_dist(a, B_))
  C[i] = B_[s]
  # Remove the paried 
  B_ = np.delete(B_, (s), axis=0)

print (A.shape, B.shape, C.shape)
print (np.allclose(A,C))

Output: Output:

(1000, 3) (2000, 3) (1000, 3)
True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找两个一维 arrays 之间所有匹配值索引的最快方法(有重复项) - Fastest way to find all indexes of matching values between two 1D arrays (with duplicates) 寻找最快的方法来找到相等长度的两个数组在numpy中的确切重叠 - Looking for the fastest way to find the exact overlap between two arrays of equal length in numpy 找到两个 arrays 之间最小欧几里得距离的最快方法 - fastest way to find min euclidean distance between two arrays 比较两个Numpy数组的最快方法 - Fastest way of comparing two numpy arrays 在 Numpy 数组中查找所有接近数对的最快方法 - Fastest way to find all pairs of close numbers in a Numpy array 使用两个numpy向量中的元素对的函数填充矩阵的最快方法? - Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors? 在两个不同维度的numpy数组之间获取所有唯一数字的最快方法 - Fastest way to get all unique numbers between two different dimension numpy arrays 将两个二维 numpy 数组相乘的最快方法是什么? - What is the fastest way to multiply two 2D numpy arrays? 在两个python numpy数组中检查条件的最快方法是什么? - What is fastest way to check conditions in two python numpy arrays? 向量化两个numpy数组中所有对元素之间的运算 - Vectorizing an operation between all pairs of elements in two numpy arrays
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM