[英]extract the N closest pairs from a numpy distance array
I have a large, symmetric, 2D distance array. 我有一个大型的对称2D距离阵列。 I want to get closest N pairs of observations.
我想获得最接近的N对观测值。
The array is stored as a numpy condensed array, and has of the order of 100 million observations. 该数组存储为numpy压缩数组,并具有1亿个观测值。
Here's an example to get the 100 closest distances on a smaller array (~500k observations), but it's a lot slower than I would like. 这是一个在较小的数组上获得100个最近距离的示例(〜500k观测值),但是比我想要的要慢得多。
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
# converts an index in a condensed array to the
# pair of observations it represents
# modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
ti = np.triu_indices(n, 1)
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
pair = condensed_to_square_index(n, i)
r.append(pair)
It seems to me like there must be quicker ways to do this with standard numpy or scipy functions, but I'm stumped. 在我看来,必须使用标准的numpy或scipy函数来更快地执行此操作,但是我很困惑。
NB If lots of pairs are equidistant, that's OK and I don't care about their ordering in that case. 注意:如果许多对是等距的,那没关系,在这种情况下,我不在乎它们的顺序。
You don't need to calculate ti
in each call to condensed_to_square_index
. 您无需在每次对
condensed_to_square_index
调用中计算ti
。 Here's a basic modification that calculates it only once: 这是仅计算一次的基本修改:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)
for i in closest:
pair = condensed_to_square_index(ti, i)
r.append(pair)
You can also vectorize the creation of r
: 您还可以向量化
r
的创建:
r = zip(ti[0][closest] + 1, ti[1][closest] + 1)
or 要么
r = np.vstack(ti)[:, closest] + 1
You can speed up the location of the minimum values very notably if you are using numpy 1.8 using np.partition
: 如果使用
np.partition
使用numpy 1.8,则可以显着加快最小值的位置:
def smallest_n(a, n):
return np.sort(np.partition(a, n)[:n])
def argsmallest_n(a, n):
ret = np.argpartition(a, n)[:n]
b = np.take(a, ret)
return np.take(ret, np.argsort(b))
dists = np.random.rand(1000*999//2) # a pdist array
In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True
In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop
In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop
And once you have the smallest indices, you don't need a loop to extract the indices, do it in a single shot: 一旦索引最小,就不需要循环来提取索引,只需一次即可完成:
closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
np.take(tu[1], closest))) + 1
The best solution probably won't generate all of the distances. 最好的解决方案可能不会生成所有距离。
Proposal: 提案:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.