简体   繁体   中英

Optimise Euclidean distance matrix algorithm if only interested in closest points

The following Euclidean distance algorithm creates a MxM matrix of distances between the rows of an MxN input matrix (representative of points in some N dimensional space). The speed of this algorithm scales in O(m^2). Can this be improved upon if only interested in the rows (ie points) that are closest to each other? (My downstream task constists of performing K-NN, amongst other things)

import numpy as np


vectors = np.random.randn(100, 20)
m = vectors.shape[0]

distances = np.zeros([m, m])
for i in range(m):
    vec = vectors[i]
    distances[i] = [np.linalg.norm(vec - vectors[j]) for j in range(m)]

I would suggest leveraging scipy 's condensed distance matrix instead of the for-loop of pairwise comparisons. In particular,

from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(vectors))

provides a ~85x speedup! The documentation can be found on here .

Fundamentally, the complexity seems to remain quadratic (as you need to compare every element of vectors with one another). However, the implementation leverages symmetry and the fact that the distance of every element to itself is 0 , thereby only computing the upper triangular sub-matrix and then mirroring it along the diagonal to obtain the quadratic distance matrix.

Your code ran in 71ms while SciPy ran in 0.83ms. A detailed performance comparison can be found in this thread .

Regardless, if you try to run kNN you might want to consider scikit-learn where you can simply provide the vectors as X as shown on here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM