Given two large numpy arrays A
and B
with different number of rows ( len(B) > len(A)
) but same number of columns ( A.shape[1] = B.shape[1] = 3
). I want to know the fastest way to get a subset C
from B
that has the minimum total distance (sum of all pair-wise distances) to A
without duplicates (each pair must be both unique). This means C
should have the same shape as A
.
Below is my code, but there are two main issues:
np.linalg.norm
(needs to take care of periodic boundary conditions). I think this is definitely not the fastest way to go since the code below calls the distance-calculating function one pair per time. There is a significant overhead when I call the more expensive distance-calculating function and it will run forever. Any suggestions?import numpy as np
from operator import itemgetter
import random
import time
A = 100.*np.random.rand(1000, 3)
B = A.copy()
for (i,j), _ in np.ndenumerate(B):
B[i,j] += np.random.rand()
B = np.vstack([B, 100.*np.random.rand(500, 3)])
def calc_dist(x, y):
return np.linalg.norm(x - y)
t0 = time.time()
taken = []
for rowi in A:
res = min(((k, calc_dist(rowi, rowj)) for k, rowj in enumerate(B)
if k not in taken), key=itemgetter(1))
taken.append(res[0])
C = B[taken]
print(A.shape, B.shape, C.shape)
>>> (1000, 3) (1500, 3) (1000, 3)
print(time.time() - t0)
>>> 12.406389951705933
Edit: for those who are interested in the expensive distance-calculating function, it uses the ase
package (can be installed by pip install ase
)
from ase.geometry import find_mic
def calc_mic_dist(x, y):
return find_mic(np.array([x]) - np.array([y]),
cell=np.array([[50., 0.0, 0.0],
[25., 45., 0.0],
[0.0, 0.0, 100.]]))[1][0]
If you're OK with calculating the whole N² distances, which isn't that expensive for the sizes you've given, scipy.optimize
has a function that will solve this directly.
import scipy.optimize
cost = np.linalg.norm(A[:, np.newaxis, :] - B, axis=2)
_, indexes = scipy.optimize.linear_sum_assignment(cost)
C = B[indexes]
Using the power of numpy broadcasting and vectorization
find_mic
method in ase.geometry
can handle 2d np arrays.
from ase.geometry import find_mic
def calc_mic_dist(x, y):
return find_mic(x - y,
cell=np.array([[50., 0.0, 0.0],
[25., 45., 0.0],
[0.0, 0.0, 100.]]))[1]
Test:
x = np.random.randn(1,3)
y = np.random.randn(5,3)
print (calc_mic_dist(x,y).shape)
# It is a distance metrics so:
assert np.allclose(calc_mic_dist(x,y), calc_mic_dist(y,x))
Ouptput:
(5,)
As you can see the metrics is calculated for each value of x
with each value of y
, because xy
in numpy does the magic of broadcasting.
def calc_mic_dist(x, y):
return find_mic(x - y,
cell=np.array([[50., 0.0, 0.0],
[25., 45., 0.0],
[0.0, 0.0, 100.]]))[1]
t0 = time.time()
A = 100.*np.random.rand(1000, 3)
B = 100.*np.random.rand(5000, 3)
selected = [np.argmin(calc_mic_dist(a, B)) for a in A]
C = B[selected]
print (A.shape, B.shape, C.shape)
print (f"Time: {time.time()-t0}")
Output:
(1000, 3) (5000, 3) (1000, 3)
Time: 9.817562341690063
Takes around 10secs on google collab
We know that calc_mic_dist(x,x)
== 0
so If A
is a subset of B
then C
should exactly be A
A = 100.*np.random.rand(1000, 3)
B = np.vstack([100.*np.random.rand(500, 3), A, 100.*np.random.rand(500, 3)])
selected = [np.argmin(calc_mic_dist(a, B)) for a in A]
C = B[selected]
print (A.shape, B.shape, C.shape)
print (np.allclose(A,C))
Output:
(1000, 3) (2000, 3) (1000, 3)
True
Once a vector in
B
is selected it cannot be again selected for other values ofA
This can be achieved by remove the selected vector from B
once it is selected so that it does not appear again for next rows of A
as a possible candidate.
A = 100.*np.random.rand(1000, 3)
B = np.vstack([100.*np.random.rand(500, 3), A, 100.*np.random.rand(500, 3)])
B_ = B.copy()
C = np.zeros_like(A)
for i, a in enumerate(A):
s = np.argmin(calc_mic_dist(a, B_))
C[i] = B_[s]
# Remove the paried
B_ = np.delete(B_, (s), axis=0)
print (A.shape, B.shape, C.shape)
print (np.allclose(A,C))
Output:
(1000, 3) (2000, 3) (1000, 3)
True
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.