在 Numpy 数组中查找所有接近数对的最快方法

Question

假设我有一个N = 10随机浮点数的 Numpy 数组：

import numpy as np
np.random.seed(99)
N = 10
arr = np.random.uniform(0., 10., size=(N,))
print(arr)

out[1]: [6.72278559 4.88078399 8.25495174 0.31446388 8.08049963 
         5.6561742 2.97622499 0.46695721 9.90627399 0.06825733]

我想找到所有唯一的数字对，它们彼此之间的差异不超过公差tol = 1. （即绝对差异 <= 1）。 具体来说，我想获取所有唯一的索引对。 每个close pair的索引都应该排序，所有close pair都应该按第一个索引排序。 我设法编写了以下工作代码：

def all_close_pairs(arr, tol=1.):
    res = set()
    for i, x1 in enumerate(arr):
        for j, x2 in enumerate(arr):
            if i == j:
                continue
            if np.isclose(x1, x2, rtol=0., atol=tol):
                res.add(tuple(sorted([i, j])))
    res = np.array(list(res))
    return res[res[:,0].argsort()]

print(all_close_pairs(arr, tol=1.))

out[2]: [[1 5]
         [2 4]
         [3 7]
         [3 9]
         [7 9]]

然而，实际上我有一个N = 1000数字的数组，并且由于嵌套的 for 循环，我的代码变得非常慢。 我相信使用 Numpy 矢量化有更有效的方法。 有谁知道在 Numpy 中最快的方法？

Answer 1

问题是您的代码具有 O(n*n) （二次）复杂度。 为了降低复杂性，您可以尝试先对项目进行排序：

def all_close_pairs(arr, tol=1.):
    res = set()
    arr = sorted(enumerate(arr), key=lambda x: x[1])
    for (idx1, (i, x1)) in enumerate(arr):
        for idx2 in range(idx1-1, -1, -1):
            j, x2 = arr[idx2]
            if not np.isclose(x1, x2, rtol=0., atol=tol):
                break
            indices = sorted([i, j])
            res.add(tuple(indices))
    return np.array(sorted(res))

但是，这仅在您的值范围远大于公差时才有效。

您可以通过使用2 pointers策略进一步改进这一点，但总体复杂性将保持不变。

Answer 2

这是一个纯 numpy 操作的解决方案。 在我的机器上看起来相当快，但我不知道我们正在寻找什么样的速度。

def all_close_pairs(arr, tol=1.):
    N = arr.shape[0]
    # get indices in the array to consider using meshgrid
    pair_coords = np.array(np.meshgrid(np.arange(N), np.arange(N))).T
    # filter out pairs so we get indices in increasing order
    pair_coords = pair_coords[pair_coords[:, :, 0] < pair_coords[:, :, 1]]
    # compare indices in your array for closeness
    is_close = np.isclose(arr[pair_coords[:, 0]], arr[pair_coords[:, 1]], rtol=0, atol=tol)
    return pair_coords[is_close, :]

Answer 3

一种有效的解决方案是首先使用index = np.argsort()对输入值进行排序。 然后，您可以使用arr[index]生成排序数组，然后如果快速连续数组上的对数较少，则在准线性时间内迭代接近的值。 如果对的数量很大，那么由于生成的对数是二次的，所以复杂度是二次的。 得到的复杂度是： O(n log n + m)其中n是输入数组的大小， m是产生的对数。

要找到彼此接近的值，一种有效的方法是使用Numba迭代值。 事实上，虽然在 Numpy 中可能是可能的，但由于要比较的值的数量可变，它可能效率不高。 这是一个实现：

import numba as nb

@nb.njit('int32[:,::1](float64[::1], float64)')
def findCloseValues(arr, tol):
    res = []
    for i in range(arr.size):
        val = arr[i]
        # Iterate over the close numbers (only once)
        for j in range(i+1, arr.size):
            # Sadly neither np.isclose or np.abs are implemented in Numba so far
            if max(val, arr[j]) - min(val, arr[j]) >= tol:
                break
            res.append((i, j))
    if len(res) == 0: # No pairs: we need to help Numpy to know the shape
        return np.empty((0, 2), dtype=np.int32)
    return np.array(res, dtype=np.int32)

最后，需要更新索引以引用未排序数组中的索引而不是排序数组。 您可以使用index[result]来做到这一点。

这是生成的代码：

index = arr.argsort()
result = findCloseValues(arr[index], 1.0)
print(index[result])

这是结果（顺序与问题中的不同，但您可以根据需要对其进行排序）：

array([[9, 3],
       [9, 7],
       [3, 7],
       [1, 5],
       [4, 2]])

提高算法的复杂度

如果您需要更快的算法，那么您可以使用另一种 output 格式：您可以为每个输入值提供接近目标输入值的最小值/最大值范围。 要查找范围，您可以在已排序的数组上使用二进制搜索（请参阅： np.searchsorted ）。 生成的算法在O(n log n)中运行。 但是，您无法获取未排序数组中的索引，因为该范围是不连续的。

Answer 4

您可以首先使用 itertools.combinations 创建组合：

def all_close_pairs(arr, tolerance):
    pairs = list(combinations(arr, 2))
    indexes = list(combinations(range(len(arr)), 2))
    all_close_pairs_indexes = [indexes[i] for i,pair in enumerate(pairs) if abs(pair[0] - pair[1]) <=  tolerance]
    return all_close_pairs_indexes

现在，对于 N=1000，您将只需要比较 499500 对而不是 100 万对！

它比原始代码快约 900 倍。 对于 N=1000，在我的机器上大约需要 0.15 秒。

Answer 5

有点晚了，但所有 numpy 解决方案：

import numpy as np

def close_enough( arr, tol = 1 ): 
    result = np.where( np.triu(np.isclose( arr[ :, None ], arr[ None, : ], r_tol = 0.0, atol = tol ), 1)) 
    return np.swapaxes( result, 0, 1 )

展开以解释正在发生的事情

def close_enough( arr, tol = 1 ):
    bool_arr = np.isclose( arr[ :, None ], arr[ None, : ], rtol = 0.0, atol = tol )
    # is_close generates a square array after comparing all elements with all elements.  

    bool_arr = np.triu( bool_arr, 1 ) 
    # Keep the upper right triangle, offset by 1 column. i.e. zero the main diagonal 
    # and all elements below and to the left.

    result = np.where( bool_arr )  # Return the row and column indices for Trues
    return np.swapaxes( result, 0, 1 ) # Return the pairs in rows rather than columns

N = 1000，arr = 浮点数数组

%timeit close_enough( arr, tol = 1 )                                                                              
14.1 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %timeit all_close_pairs( arr, tol = 1 )                                                                           
54.3 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(close_enough( arr, tol = 1) == all_close_pairs( arr, tol = 1 )).all()                                            
# True

在 Numpy 数组中查找所有接近数对的最快方法

问题描述

5 个解决方案

解决方案1
1 2021-11-28 16:44:49

解决方案2
1 2021-11-28 17:19:29

解决方案3
1 2021-11-28 17:31:03

提高算法的复杂度

解决方案4
0 2021-11-28 17:53:44

解决方案5
0 2021-11-28 19:40:04

在 Numpy 数组中查找所有接近数对的最快方法

问题描述

5 个解决方案

解决方案1 1 2021-11-28 16:44:49

解决方案2 1 2021-11-28 17:19:29

解决方案3 1 2021-11-28 17:31:03

提高算法的复杂度

解决方案4 0 2021-11-28 17:53:44

解决方案5 0 2021-11-28 19:40:04

解决方案1
1 2021-11-28 16:44:49

解决方案2
1 2021-11-28 17:19:29

解决方案3
1 2021-11-28 17:31:03

解决方案4
0 2021-11-28 17:53:44

解决方案5
0 2021-11-28 19:40:04