從一個numpy距離數組中提取N個最接近的對

Question

我有一個大型的對稱2D距離陣列。 我想獲得最接近的N對觀測值。

該數組存儲為numpy壓縮數組，並具有1億個觀測值。

這是一個在較小的數組上獲得100個最近距離的示例（〜500k觀測值），但是比我想要的要慢得多。

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
    # converts an index in a condensed array to the 
    # pair of observations it represents
    # modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
    ti = np.triu_indices(n, 1)
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
    pair = condensed_to_square_index(n, i)
    r.append(pair)

在我看來，必須使用標准的numpy或scipy函數來更快地執行此操作，但是我很困惑。

注意：如果許多對是等距的，那沒關系，在這種情況下，我不在乎它們的順序。

Answer 1

您無需在每次對condensed_to_square_index調用中計算ti 。 這是僅計算一次的基本修改：

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)

for i in closest:
    pair = condensed_to_square_index(ti, i)
    r.append(pair)

您還可以向量化r的創建：

r  = zip(ti[0][closest] + 1, ti[1][closest] + 1)

要么

r = np.vstack(ti)[:, closest] + 1

Answer 2

如果使用np.partition使用numpy 1.8，則可以顯着加快最小值的位置：

def smallest_n(a, n):
    return np.sort(np.partition(a, n)[:n])

def argsmallest_n(a, n):
    ret = np.argpartition(a, n)[:n]
    b = np.take(a, ret)
    return np.take(ret, np.argsort(b))

dists = np.random.rand(1000*999//2) # a pdist array

In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True

In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop

In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop

一旦索引最小，就不需要循環來提取索引，只需一次即可完成：

closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
                         np.take(tu[1], closest))) + 1

Answer 3

最好的解決方案可能不會生成所有距離。

提案：

堆最大大小為100（如果變大，則減小堆）。
使用最接近的對算法查找最接近的對。
將對添加到堆（優先級隊列）。
選擇一對。 將其99個最近的鄰居添加到堆中。
從列表中刪除選擇的點。
找到下一個最接近的對並重復。 添加的鄰居數是100減去您運行最接近對算法的次數。

從一個numpy距離數組中提取N個最接近的對

問題描述

3 個解決方案

解決方案1
3 2013-12-12 10:59:32

解決方案2
2 已采納 2013-12-12 15:00:42

解決方案3
0 2013-12-12 10:53:38

從一個numpy距離數組中提取N個最接近的對

問題描述

3 個解決方案

解決方案1 3 2013-12-12 10:59:32

解決方案2 2 已采納 2013-12-12 15:00:42

解決方案3 0 2013-12-12 10:53:38

解決方案1
3 2013-12-12 10:59:32

解決方案2
2 已采納 2013-12-12 15:00:42

解決方案3
0 2013-12-12 10:53:38