[英]extract the N closest pairs from a numpy distance array
我有一個大型的對稱2D距離陣列。 我想獲得最接近的N對觀測值。
該數組存儲為numpy壓縮數組,並具有1億個觀測值。
這是一個在較小的數組上獲得100個最近距離的示例(〜500k觀測值),但是比我想要的要慢得多。
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
# converts an index in a condensed array to the
# pair of observations it represents
# modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
ti = np.triu_indices(n, 1)
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
pair = condensed_to_square_index(n, i)
r.append(pair)
在我看來,必須使用標准的numpy或scipy函數來更快地執行此操作,但是我很困惑。
注意:如果許多對是等距的,那沒關系,在這種情況下,我不在乎它們的順序。
您無需在每次對condensed_to_square_index
調用中計算ti
。 這是僅計算一次的基本修改:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)
for i in closest:
pair = condensed_to_square_index(ti, i)
r.append(pair)
您還可以向量化r
的創建:
r = zip(ti[0][closest] + 1, ti[1][closest] + 1)
要么
r = np.vstack(ti)[:, closest] + 1
如果使用np.partition
使用numpy 1.8,則可以顯着加快最小值的位置:
def smallest_n(a, n):
return np.sort(np.partition(a, n)[:n])
def argsmallest_n(a, n):
ret = np.argpartition(a, n)[:n]
b = np.take(a, ret)
return np.take(ret, np.argsort(b))
dists = np.random.rand(1000*999//2) # a pdist array
In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True
In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop
In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop
一旦索引最小,就不需要循環來提取索引,只需一次即可完成:
closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
np.take(tu[1], closest))) + 1
最好的解決方案可能不會生成所有距離。
提案:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.