简体   繁体   中英

Count Overlap Between Neighboring Indices in NumPy Array

I have a NumPy array of integers:

x = np.array([1, 0, 2, 1, 4, 1, 4, 1, 0, 1, 4, 3, 0, 1, 0, 2, 1, 4, 3, 1, 4, 1, 0])

and another array of indices that references the array above:

indices = np.array([22, 12, 8, 1, 14, 21, 7, 0, 13, 19, 5, 3, 9, 16, 2, 15, 11, 18, 20, 6, 4, 10, 17])

For every pair of neighboring indices, we need to count how many consecutive values in x are overlapping starting at each of the two neighboring indices. For example, for indices[2] and indices[3] , we have index 8 and 1 , respectively, and they both reference positions in x . Then, starting at x[8] and x[1] , we count how many consecutive values are the same or are overlapping but we stop checking the overlap under specific conditions (see below). In other words, we check if:

  1. x[8] == x[1]
  2. x[9] == x[2] # increment each index by one
  3. ... # continue incrementing each index except in the following conditions
  4. stop if i >= x.shape[0]
  5. stop if j >= x.shape[0]

6. stop if x[i] == 0 7. stop if x[j] == 0

  1. stop if x[i] != x[j]

In reality, we do this for all neighboring index pairs:

out = np.zeros(indices.shape[0], dtype=int)
for idx in range(indices.shape[0]-1):
    count = 0
    i = indices[idx]
    j = indices[idx + 1]
    k = 0
    # while i+k < x.shape[0] and j+k < x.shape[0] and x[i+k] != 0 and x[j+k] != 0 and x[i+k] == x[j+k]:
    while i+k < x.shape[0] and j+k < x.shape[0] and x[i+k] == x[j+k]:
        count += 1
        k += 1
        out[idx] = k

And the output is:

# [0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 2, 3, 0, 3, 0, 1, 0, 2, 2, 1, 2, 0]  # This is the old output if x[i] == 0 and x[j] == 0 are included

[1 2 1 4 0 2 2 5 1 4 3 2 3 0 3 0 1 0 3 2 1 2 0]

I'm looking for a vectorized way to do this in NumPy.

This should do the trick (I am ignoring the two conditions x[i]=0 and x[j]=0 )

for idx in range(indices.shape[0]-1):

    i = indices[idx]
    j = indices[idx + 1]

    l = len(x) - max(i,j)
    x1 = x[i:i+l]
    x2 = x[j:j+l]

    # Add False at the end to handle the case in which arrays are exactly the same
    x0 = np.append(x1==x2, False)

    out[idx] = np.argmin(x0)

Notice that with np.argmin I am exploiting the following two facts:

  • False < True
  • np.argmin only returns the first instance of the min in the array

Performance Analysis

Regarding time performance, I tested with N=10**5 and N=10**6 , and as suggested in the comments, this cannot compete with numba jit.

def f(x, indices):

    out = np.zeros(indices.shape[0], dtype=int)

    for idx in range(indices.shape[0]-1):

        i = indices[idx]
        j = indices[idx + 1]

        l = len(x) - max(i,j)
        x1 = x[i:i+l]
        x2 = x[j:j+l]

        x0 = np.append(x1==x2, False)

        out[idx] = np.argmin(x0)

    return out

N=100_000
x = np.random.randint(0,10, N)
indices = np.arange(0, N)
np.random.shuffle(indices)

%timeit f(x, indices)
3.67 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

N=1_000_000
x = np.random.randint(0,10, N)
indices = np.arange(0, N)
np.random.shuffle(indices)

%time f(x, indices)
Wall time: 8min 20s

(I did not have the patience to let %timeit finish)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM