简体   繁体   English

从一个numpy距离数组中提取N个最接近的对

[英]extract the N closest pairs from a numpy distance array

I have a large, symmetric, 2D distance array. 我有一个大型的对称2D距离阵列。 I want to get closest N pairs of observations. 我想获得最接近的N对观测值。

The array is stored as a numpy condensed array, and has of the order of 100 million observations. 该数组存储为numpy压缩数组,并具有1亿个观测值。

Here's an example to get the 100 closest distances on a smaller array (~500k observations), but it's a lot slower than I would like. 这是一个在较小的数组上获得100个最近距离的示例(〜500k观测值),但是比我想要的要慢得多。

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
    # converts an index in a condensed array to the 
    # pair of observations it represents
    # modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
    ti = np.triu_indices(n, 1)
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
    pair = condensed_to_square_index(n, i)
    r.append(pair)

It seems to me like there must be quicker ways to do this with standard numpy or scipy functions, but I'm stumped. 在我看来,必须使用标准的numpy或scipy函数来更快地执行此操作,但是我很困惑。

NB If lots of pairs are equidistant, that's OK and I don't care about their ordering in that case. 注意:如果许多对是等距的,那没关系,在这种情况下,我不在乎它们的顺序。

You don't need to calculate ti in each call to condensed_to_square_index . 您无需在每次对condensed_to_square_index调用中计算ti Here's a basic modification that calculates it only once: 这是仅计算一次的基本修改:

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)

for i in closest:
    pair = condensed_to_square_index(ti, i)
    r.append(pair)

You can also vectorize the creation of r : 您还可以向量化r的创建:

r  = zip(ti[0][closest] + 1, ti[1][closest] + 1)

or 要么

r = np.vstack(ti)[:, closest] + 1

You can speed up the location of the minimum values very notably if you are using numpy 1.8 using np.partition : 如果使用np.partition使用numpy 1.8,则可以显着加快最小值的位置:

def smallest_n(a, n):
    return np.sort(np.partition(a, n)[:n])

def argsmallest_n(a, n):
    ret = np.argpartition(a, n)[:n]
    b = np.take(a, ret)
    return np.take(ret, np.argsort(b))

dists = np.random.rand(1000*999//2) # a pdist array

In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True

In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop

In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop

And once you have the smallest indices, you don't need a loop to extract the indices, do it in a single shot: 一旦索引最小,就不需要循环来提取索引,只需一次即可完成:

closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
                         np.take(tu[1], closest))) + 1

The best solution probably won't generate all of the distances. 最好的解决方案可能不会生成所有距离。

Proposal: 提案:

  1. Make a heap of max size 100 (if it grows bigger, reduce it). 堆最大大小为100(如果变大,则减小堆)。
  2. Use the Closest Pair algorithm to find the closest pair. 使用最接近的对算法查找最接近的对。
  3. Add the pair to the heap (priority queue). 将对添加到堆(优先级队列)。
  4. Choose one of that pair. 选择一对。 Add its 99 closest neighbors to the heap. 将其99个最近的邻居添加到堆中。
  5. Remove the chosen point from the list. 从列表中删除选择的点。
  6. Find the next closest pair and repeat. 找到下一个最接近的对并重复。 The number of neighbors added is 100 minus the number of times you ran the Closest Pair algorithm. 添加的邻居数是100减去您运行最接近对算法的次数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM