找到數組中的最近點 - KDTree的逆

Question

我有一個非常大的ndarray A和一個排序的點k列表（一個小列表，大約30點）。

對於A的每個元素，我想確定點k列表中最接近的元素以及索引。 所以類似於：

>>> A = np.asarray([3, 4, 5, 6])
>>> k = np.asarray([4.1, 3])
>>> values, indices
[3, 4.1, 4.1, 4.1], [1, 0, 0, 0]

現在，問題是A非常大。 所以我不能做一些低效的事情，例如向A添加一個維度，將abs差異取為k，然后取每列的最小值。

現在我一直在使用np.searchsorted，如第二個答案所示：在numpy數組中查找最接近的值，但即使這樣也太慢了。 這是我使用的代碼（修改為使用多個值）：

def find_nearest(A,k):

    indicesClosest = np.searchsorted(k, A)
    flagToReduce = indicesClosest==k.shape[0]
    modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
    modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1
    flagToReduce = np.logical_or(flagToReduce,
                     np.abs(A-k[indicesClosest-1]) <
                     np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
    flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
    indicesClosest[flagToReduce] -= 1
    valuesClosest = k[indicesClosest]
    return valuesClosest, indicesClosest

然后我想到使用scipy.spatial.KDTree：

>>> d = scipy.spatial.KDTree(k)
>>> d.query(A)

事實證明這比搜索解決方案要慢得多。

另一方面，陣列A總是相同的，只有k變化。 因此，在A上使用一些輔助結構（如“逆KDTree”），然后在小數組k上查詢結果將是有益的。

有類似的東西嗎？

編輯

目前我使用的是np.searchsorted的變體，需要對數組A進行排序。 我們可以提前做這個作為預處理步驟，但我們仍然必須在計算索引后恢復原始順序。 該變體的速度大約是上述變體的兩倍。

A = np.random.random(3000000)
k = np.random.random(30)

indices_sort = np.argsort(A)
sortedA = A[indices_sort]

inv_indices_sort = np.argsort(indices_sort)
k.sort()


def find_nearest(sortedA, k):
    midpoints = k[:-1] + np.diff(k)/2
    idx_aux = np.searchsorted(sortedA, midpoints)
    idx = []
    count = 0
    final_indices = np.zeros(sortedA.shape, dtype=int)
    old_obj = None
    for obj in idx_aux:
        if obj != old_obj:
            idx.append((obj, count))
            old_obj = obj
        count += 1
    old_idx = 0
    for idx_A, idx_k in idx:
        final_indices[old_idx:idx_A] = idx_k
        old_idx = idx_A
    final_indices[old_idx:] = len(k)-1

    indicesClosest = final_indices[inv_indices_sort] #<- this takes 90% of the time
    return k[indicesClosest], indicesClosest

花費這么多時間的線是將索引恢復到原始順序的線。

Answer 1

更新：

內置函數numpy.digitize實際上可以完全滿足您的需求。 只需要一個小技巧： digitize將值分配給容器。 我們可以通過對數組進行排序並將bin邊界精確地設置在相鄰元素之間的中間來將k轉換為bin。

import numpy as np

A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3, 1])  # added another value to show that sorting/binning works

ki = np.argsort(k)
ks = k[ki]

i = np.digitize(A, (ks[:-1] + ks[1:]) / 2)

indices = ki[i]
values = ks[i]

print(values, indices)
# [ 3.   4.1  4.1  4.1] [1 0 0 0]

老答案：

對於k每個元素，我會采用蠻力方法對A執行一次矢量化傳遞，並更新當前元素改善近似值的那些位置。

import numpy as np

A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3])

err = np.zeros_like(A) + np.inf  # keep track of error over passes

values = np.empty_like(A, dtype=k.dtype)
indices = np.empty_like(A, dtype=int)

for i, v in enumerate(k):
    d = np.abs(A - v)
    mask = d < err  # only update where v is closer to A
    values[mask] = v
    indices[mask] = i
    err[mask] = d[mask]

print(values, indices)
# [ 3.   4.1  4.1  4.1] [1 0 0 0]

此方法需要三個與A大小相同的臨時變量，因此如果沒有足夠的可用內存，它將失敗。

Answer 2

所以，經過scipy郵件列表中的一些工作和想法，我認為在我的情況下（具有常數A和緩慢變化的k），最好的方法是使用以下實現。

class SearchSorted:
    def __init__(self, tensor, use_k_optimization=True):

        '''
        use_k_optimization requires storing 4x the size of the tensor.
        If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
        When this happens, we can cut the running time significantly by storing additional variables. If it won't be
        called with successive k, set the flag to False, as otherwise would just consume more memory for no
        good reason
        '''

        self.indices_sort = np.argsort(tensor)
        self.sorted_tensor = tensor[self.indices_sort]
        self.inv_indices_sort = np.argsort(self.indices_sort)
        self.use_k_optimization = use_k_optimization

        self.previous_indices_results = None
        self.prev_idx_A_k_pair = None

    def query(self, k):
        midpoints = k[:-1] + np.diff(k) / 2
        idx_count = np.searchsorted(self.sorted_tensor, midpoints)
        idx_A_k_pair = []
        count = 0

        old_obj = 0
        for obj in idx_count:
            if obj != old_obj:
                idx_A_k_pair.append((obj, count))
                old_obj = obj
            count += 1

        if not self.use_k_optimization or self.previous_indices_results is None:
            #creates the index matrix in the sorted case
            final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
            #and now unsort it to match the original tensor position
            indicesClosest = final_indices[self.inv_indices_sort]
            if self.use_k_optimization:
                self.prev_idx_A_k_pair = idx_A_k_pair
                self.previous_indices_results = indicesClosest
            return indicesClosest

        old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
        new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
        mask = new_indices_unsorted != old_indices_unsorted

        self.prev_idx_A_k_pair = idx_A_k_pair
        self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
        indicesClosest = self.previous_indices_results

        return indicesClosest

    @staticmethod
    def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
        old_idx = 0
        final_indices = np.zeros(matrix_shape, dtype=int)
        for idx_A, idx_k in idx_A_k_pair:
            final_indices[old_idx:idx_A] = idx_k
            old_idx = idx_A
        final_indices[old_idx:] = len_quant_points - 1
        return final_indices

想法是預先對數組A進行排序，然后在k的中點使用A的搜索排序。 這給出了與以前相同的信息，因為它准確地告訴我們A的哪些點更靠近k的哪個點。 方法_create_indices_matrix將根據這些信息創建完整的索引數組，然后我們將其取消以恢復A的原始順序。為了利用緩慢變化的k，我們保存最后的索引並確定我們必須改變哪些索引; 然后我們只改變那些。 對於緩慢變化的k，這會產生優異的性能（然而，在更大的存儲器成本下）。

對於500萬個元素的隨機矩陣A和約30個元素的k，並且重復實驗60次，我們得到

Function search_sorted1; 15.72285795211792s
Function search_sorted2; 13.030786037445068s
Function query; 2.3306031227111816s <- the one with use_k_optimization = True
Function query; 4.81286096572876s   <- with use_k_optimization = False

scipy.spatial.KDTree.query太慢了，我沒有時間（不過1分鍾以上）。 這是用於執行時序的代碼; 還包含search_sorted1和2的實現。

import numpy as np
import scipy
import scipy.spatial
import time


A = np.random.rand(10000*500) #5 million elements
k = np.random.rand(32)
k.sort()

#first attempt, detailed in the answer, too
def search_sorted1(A, k):
    indicesClosest = np.searchsorted(k, A)
    flagToReduce = indicesClosest == k.shape[0]
    modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
    modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1

    flagToReduce = np.logical_or(flagToReduce,
                        np.abs(A-k[indicesClosest-1]) <
                        np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
    flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
    indicesClosest[flagToReduce] -= 1

    return indicesClosest

#taken from @Divakar answer linked in the comments under the question
def search_sorted2(A, k):
    indicesClosest = np.searchsorted(k, A, side="left").clip(max=k.size - 1)
    mask = (indicesClosest > 0) & \
           ((indicesClosest == len(k)) | (np.fabs(A - k[indicesClosest - 1]) < np.fabs(A - k[indicesClosest])))
    indicesClosest = indicesClosest - mask

    return indicesClosest
def kdquery1(A, k):
    d = scipy.spatial.cKDTree(k, compact_nodes=False, balanced_tree=False)
    _, indices = d.query(A)
    return indices

#After an indea on scipy mailing list
class SearchSorted:
    def __init__(self, tensor, use_k_optimization=True):

        '''
        Using this requires storing 4x the size of the tensor.
        If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
        When this happens, we can cut the running time significantly by storing additional variables. If it won't be
        called with successive k, set the flag to False, as otherwise would just consume more memory for no
        good reason
        '''

        self.indices_sort = np.argsort(tensor)
        self.sorted_tensor = tensor[self.indices_sort]
        self.inv_indices_sort = np.argsort(self.indices_sort)
        self.use_k_optimization = use_k_optimization

        self.previous_indices_results = None
        self.prev_idx_A_k_pair = None

    def query(self, k):
        midpoints = k[:-1] + np.diff(k) / 2
        idx_count = np.searchsorted(self.sorted_tensor, midpoints)
        idx_A_k_pair = []
        count = 0

        old_obj = 0
        for obj in idx_count:
            if obj != old_obj:
                idx_A_k_pair.append((obj, count))
                old_obj = obj
            count += 1

        if not self.use_k_optimization or self.previous_indices_results is None:
            #creates the index matrix in the sorted case
            final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
            #and now unsort it to match the original tensor position
            indicesClosest = final_indices[self.inv_indices_sort]
            if self.use_k_optimization:
                self.prev_idx_A_k_pair = idx_A_k_pair
                self.previous_indices_results = indicesClosest
            return indicesClosest

        old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
        new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
        mask = new_indices_unsorted != old_indices_unsorted

        self.prev_idx_A_k_pair = idx_A_k_pair
        self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
        indicesClosest = self.previous_indices_results

        return indicesClosest

    @staticmethod
    def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
        old_idx = 0
        final_indices = np.zeros(matrix_shape, dtype=int)
        for idx_A, idx_k in idx_A_k_pair:
            final_indices[old_idx:idx_A] = idx_k
            old_idx = idx_A
        final_indices[old_idx:] = len_quant_points - 1
        return final_indices

mySearchSorted = SearchSorted(A, use_k_optimization=True)
mySearchSorted2 = SearchSorted(A, use_k_optimization=False)
allFunctions = [search_sorted1, search_sorted2,
                mySearchSorted.query,
                mySearchSorted2.query]

print(np.array_equal(mySearchSorted.query(k), kdquery1(A, k)[1]))
print(np.array_equal(mySearchSorted.query(k), search_sorted2(A, k)[1]))
print(np.array_equal(mySearchSorted2.query(k), search_sorted2(A, k)[1]))

if __name__== '__main__':
    num_to_average = 3
    for func in allFunctions:
        if func.__name__ == 'search_sorted3':
            indices_sort = np.argsort(A)
            sA = A[indices_sort].copy()
            inv_indices_sort = np.argsort(indices_sort)
        else:
            sA = A.copy()
        if func.__name__ != 'query':
            func_to_use = lambda x: func(sA, x)
        else:
            func_to_use = func
        k_to_use = k
        start_time = time.time()
        for idx_average in range(num_to_average):
            for idx_repeat in range(10):
                k_to_use += (2*np.random.rand(*k.shape)-1)/100 #uniform between (-1/100, 1/100)
                k_to_use.sort()
                indices = func_to_use(k_to_use)
                if func.__name__ == 'search_sorted3':
                    indices = indices[inv_indices_sort]
                val = k[indices]

        end_time = time.time()
        total_time = end_time-start_time

        print('Function {}; {}s'.format(func.__name__, total_time))

我確信它仍然可以做得更好（我為SerchSorted類使用了一些空間，所以我們可以保存一些東西）。 如果您有任何改進的想法，請告訴我！

找到數組中的最近點 - KDTree的逆

問題描述

2 個解決方案

解決方案1
2 2017-10-12 13:26:42

解決方案2
2 已采納 2017-10-13 14:27:23

找到數組中的最近點 - KDTree的逆

問題描述

2 個解決方案

解決方案1 2 2017-10-12 13:26:42

解決方案2 2 已采納 2017-10-13 14:27:23

解決方案1
2 2017-10-12 13:26:42

解決方案2
2 已采納 2017-10-13 14:27:23