简体   繁体   中英

find numpy array in other numpy array

I need to find a small numpy array in a much larger numpy array. For example:

import numpy as np
a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])

A function

find_numpy_array_in_other_numpy_array(a, b)

should return indices

[3, 4, 11]

that represent where the complete numpy array a appears in the complete numpy array b .

There is a brute force approach to this problem that is slow when dealing with very large b arrays:

ok = []
for idx in range(b.size - a.size + 1):
    if np.all(a == b[idx : idx + a.size]):
        ok.append(idx)

I am looking for a much faster way to find all indices of the full array a in array b . The fast approach should also allow other comparison functions, eg to find the worst case difference between a and b :

diffs = []
for idx in range(b.size - a.size + 1):
    bi = b[idx : idx + a.size]
    diff = np.nanmax(np.abs(bi - a))
    diffs.append(diff)

Generic solution setup

For a generic solution, we can create 2D array of sliding windows and then perform the relevant operations -

from skimage.util.shape import view_as_windows

b2D = view_as_windows(b,len(a))

NumPy equivalent implementation .

Problem #1

Then, to solve for matching indices problem, it's simply -

matching_indices = np.flatnonzero((b2D==a).all(axis=1))

Problem #2

To solve for the second problem, it maps easily by keeping in mind that any ufunc reduction operation to get an output element is to be translated into reduction along the second axis in the proposed solution using that ufunc's axis argument -

diffs = np.nanmax(np.abs(b2D-a),axis=1)

The following code finds all matches of 1st element in your sequence ( a ) in array b . Then it creates a new array with columns of your possible sequence candidates, compares them to the full sequence, and filters the initial indexes

seq, arr = a, b
len_seq = len(seq)    
ini_idx = (arr[:-len_seq+1]==seq[0]).nonzero()[0] # idx of possible sequence canditates   
seq_candidates = arr[np.arange(1, len_seq)[:, None]+ini_idx] # columns with possible seq. candidates   
mask = (seq_candidates==seq[1:,None]).all(axis=0)
idx = ini_idx[mask]

You can consider using Numba to compile the function. You could do it like this:

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def search_in_array(a, b):
    idx = np.empty(len(b) - len(a) + 1, dtype=np.bool_)
    for i in nb.prange(len(idx)):
        idx[i] = np.all(a == b[i:i + len(a)])
    return np.where(idx)[0]

a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])
print(search_in_array(a, b))
# [ 3  4 11]

A quick benchmark:

import numpy as np

np.random.seed(100)
a = np.random.randint(5, size=10)
b = np.random.randint(5, size=10_000_000)

# Non-compiled function
%timeit search_in_array.py_func(a, b)
# 22.8 s ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Compiled function
%timeit search_in_array(a, b)
# 54.7 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you see, you can get a ~400x speedup and the memory cost is relatively low (a boolean array the same size as the big array).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM