I need to find a small numpy array in a much larger numpy array. For example:
import numpy as np
a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])
A function
find_numpy_array_in_other_numpy_array(a, b)
should return indices
[3, 4, 11]
that represent where the complete numpy array a
appears in the complete numpy array b
.
There is a brute force approach to this problem that is slow when dealing with very large b
arrays:
ok = []
for idx in range(b.size - a.size + 1):
if np.all(a == b[idx : idx + a.size]):
ok.append(idx)
I am looking for a much faster way to find all indices of the full array a
in array b
. The fast approach should also allow other comparison functions, eg to find the worst case difference between a
and b
:
diffs = []
for idx in range(b.size - a.size + 1):
bi = b[idx : idx + a.size]
diff = np.nanmax(np.abs(bi - a))
diffs.append(diff)
For a generic solution, we can create 2D
array of sliding windows and then perform the relevant operations -
from skimage.util.shape import view_as_windows
b2D = view_as_windows(b,len(a))
NumPy equivalent implementation
.
Problem #1
Then, to solve for matching indices problem, it's simply -
matching_indices = np.flatnonzero((b2D==a).all(axis=1))
Problem #2
To solve for the second problem, it maps easily by keeping in mind that any ufunc reduction operation to get an output element is to be translated into reduction along the second axis in the proposed solution using that ufunc's axis
argument -
diffs = np.nanmax(np.abs(b2D-a),axis=1)
The following code finds all matches of 1st element in your sequence ( a
) in array b
. Then it creates a new array with columns of your possible sequence candidates, compares them to the full sequence, and filters the initial indexes
seq, arr = a, b
len_seq = len(seq)
ini_idx = (arr[:-len_seq+1]==seq[0]).nonzero()[0] # idx of possible sequence canditates
seq_candidates = arr[np.arange(1, len_seq)[:, None]+ini_idx] # columns with possible seq. candidates
mask = (seq_candidates==seq[1:,None]).all(axis=0)
idx = ini_idx[mask]
You can consider using Numba to compile the function. You could do it like this:
import numpy as np
import numba as nb
@nb.njit(parallel=True)
def search_in_array(a, b):
idx = np.empty(len(b) - len(a) + 1, dtype=np.bool_)
for i in nb.prange(len(idx)):
idx[i] = np.all(a == b[i:i + len(a)])
return np.where(idx)[0]
a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])
print(search_in_array(a, b))
# [ 3 4 11]
A quick benchmark:
import numpy as np
np.random.seed(100)
a = np.random.randint(5, size=10)
b = np.random.randint(5, size=10_000_000)
# Non-compiled function
%timeit search_in_array.py_func(a, b)
# 22.8 s ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Compiled function
%timeit search_in_array(a, b)
# 54.7 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you see, you can get a ~400x speedup and the memory cost is relatively low (a boolean array the same size as the big array).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.