在其他numpy数组中找到numpy数组

Question

I need to find a small numpy array in a much larger numpy array. 我需要在一个更大的numpy数组中找到一个小的numpy数组。 For example: 例如：

import numpy as np
a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])

A function 一个功能

find_numpy_array_in_other_numpy_array(a, b)

should return indices 应该返回索引

[3, 4, 11]

that represent where the complete numpy array a appears in the complete numpy array b . 表示完整numpy数组a出现在完整numpy数组b 。

There is a brute force approach to this problem that is slow when dealing with very large b arrays: 在处理非常大的b数组时，这种问题的蛮力方法很慢：

ok = []
for idx in range(b.size - a.size + 1):
    if np.all(a == b[idx : idx + a.size]):
        ok.append(idx)

I am looking for a much faster way to find all indices of the full array a in array b . 我正在寻找一种更快的方法来查找数组b完整数组a所有索引。 The fast approach should also allow other comparison functions, eg to find the worst case difference between a and b : 快速方法还应该允许其他比较函数，例如找出a和b之间的最坏情况差异：

diffs = []
for idx in range(b.size - a.size + 1):
    bi = b[idx : idx + a.size]
    diff = np.nanmax(np.abs(bi - a))
    diffs.append(diff)

Answer 1

Generic solution setup 通用解决方案设置

For a generic solution, we can create 2D array of sliding windows and then perform the relevant operations - 对于通用解决方案，我们可以创建滑动窗口的2D阵列，然后执行相关操作 -

from skimage.util.shape import view_as_windows

b2D = view_as_windows(b,len(a))

NumPy equivalent implementation . NumPy equivalent implementation 。

Problem #1 问题＃1

Then, to solve for matching indices problem, it's simply - 然后，为了解决匹配指数问题，它只是 -

matching_indices = np.flatnonzero((b2D==a).all(axis=1))

Problem #2 问题＃2

To solve for the second problem, it maps easily by keeping in mind that any ufunc reduction operation to get an output element is to be translated into reduction along the second axis in the proposed solution using that ufunc's axis argument - 为了解决第二个问题，它可以很容易地映射，记住任何用于获取输出元素的ufunc减少操作将使用该ufunc的axis参数在建议的解决方案中沿第二轴转换为减少 -

diffs = np.nanmax(np.abs(b2D-a),axis=1)

Answer 2

The following code finds all matches of 1st element in your sequence ( a ) in array b . 以下代码查找数组b序列（ a ）中第一个元素的所有匹配项。 Then it creates a new array with columns of your possible sequence candidates, compares them to the full sequence, and filters the initial indexes 然后它创建一个新数组，其中包含可能的候选序列列，将它们与完整序列进行比较，并过滤初始索引

seq, arr = a, b
len_seq = len(seq)    
ini_idx = (arr[:-len_seq+1]==seq[0]).nonzero()[0] # idx of possible sequence canditates   
seq_candidates = arr[np.arange(1, len_seq)[:, None]+ini_idx] # columns with possible seq. candidates   
mask = (seq_candidates==seq[1:,None]).all(axis=0)
idx = ini_idx[mask]

Answer 3

You can consider using Numba to compile the function. 您可以考虑使用Numba来编译该函数。 You could do it like this: 你可以这样做：

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def search_in_array(a, b):
    idx = np.empty(len(b) - len(a) + 1, dtype=np.bool_)
    for i in nb.prange(len(idx)):
        idx[i] = np.all(a == b[i:i + len(a)])
    return np.where(idx)[0]

a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])
print(search_in_array(a, b))
# [ 3  4 11]

A quick benchmark: 快速基准：

import numpy as np

np.random.seed(100)
a = np.random.randint(5, size=10)
b = np.random.randint(5, size=10_000_000)

# Non-compiled function
%timeit search_in_array.py_func(a, b)
# 22.8 s ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Compiled function
%timeit search_in_array(a, b)
# 54.7 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you see, you can get a ~400x speedup and the memory cost is relatively low (a boolean array the same size as the big array). 如您所见，您可以获得大约400倍的加速，并且内存成本相对较低（布尔数组与大数组相同）。

在其他numpy数组中找到numpy数组

问题描述

3 个解决方案

解决方案1
4 已采纳 2018-12-13 17:01:46

Generic solution setup 通用解决方案设置

解决方案2
0 2018-12-13 17:03:45

解决方案3
0 2018-12-13 17:55:39

在其他numpy数组中找到numpy数组

问题描述

3 个解决方案

解决方案1 4 已采纳 2018-12-13 17:01:46

Generic solution setup 通用解决方案设置

解决方案2 0 2018-12-13 17:03:45

解决方案3 0 2018-12-13 17:55:39

解决方案1
4 已采纳 2018-12-13 17:01:46

解决方案2
0 2018-12-13 17:03:45

解决方案3
0 2018-12-13 17:55:39