简体   繁体   中英

What's the most performant way of removing elements in a NumPy array, based on the amount of times they appear in another array?

Let's say I have two Numpy arrays:

a = np.array([1,2,2,3,3,3])
b = np.array([2,2,3])

and I would like to remove all elements in b from a the same amount of times they occur in b . Ie

diff(a, b)
>>> np.array([1,3,3])

Note that for my use case, b will always be a subset of a and both might be unordered, however the set-like methods like numpy.setdiff1d doesn't cut it, since it's important to remove each element a certain amount of times.

My current, lazy solution looks as follows:

def diff(a, b):
    for el in b:
        idx = (el == a).argmax()
        if a[idx] == el:
            a = np.delete(a, idx)
    return a

But I'm wondering if there are more performant or more compact, "numpy-esque" ways of writing this?

Here's a vectorized approach based on np.searchsorted -

import pandas as pd

def diff_v2(a, b):
    # Get sorted orders
    sidx = a.argsort(kind='stable')
    A = a[sidx]
    
    # Get searchsorted indices per sorted order
    idx = np.searchsorted(A,b)
    
    # Get increments
    s = pd.Series(idx)
    inc = s.groupby(s).cumcount().values
    
    # Delete elemnents off traced back positions
    return np.delete(a,sidx[idx+inc])

Further optimization

Let's resort to NumPy for the groupby cumcount part -

# Perform groupby cumcount on sorted array
def groupby_cumcount(idx):
    mask = np.r_[False,idx[:-1]==idx[1:],False]
    ids = mask[:-1].cumsum()
    count = np.diff(np.flatnonzero(~mask))
    return ids - np.repeat(ids[~mask[:-1]],count)

def diff_v3(a, b):
    # Get sorted orders
    sidx = a.argsort(kind='stable')
    A = a[sidx]
    
    # Get searchsorted indices per sorted order
    idx = np.searchsorted(A,b)
    
    # Get increments
    idx = np.sort(idx)
    inc = groupby_cumcount(idx)
    
    # Delete elemnents off traced back positions
    return np.delete(a,sidx[idx+inc])

Benchmarking

Using a setup with 10000 elements with ~2x repetitions for a and b being half size of a .

In [52]: np.random.seed(0)
    ...: a = np.random.randint(0,5000,10000)
    ...: b = a[np.random.choice(len(a), 5000,replace=False)]

In [53]: %timeit diff(a,b)
    ...: %timeit diff_v2(a,b)
    ...: %timeit diff_v3(a,b)
108 ms ± 821 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.85 ms ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.89 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Next up, on 100000 elements -

In [54]: np.random.seed(0)
    ...: a = np.random.randint(0,50000,100000)
    ...: b = a[np.random.choice(len(a), 50000,replace=False)]

In [55]: %timeit diff(a,b)
    ...: %timeit diff_v2(a,b)
    ...: %timeit diff_v3(a,b)
4.45 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
37.5 ms ± 661 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
28 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For positive numbers and with sorted output

We can use np.bincount -

def diff_v4(a, b):
    C = np.bincount(a)
    C -= np.bincount(b,minlength=len(C))
    return np.repeat(np.arange(len(C)), C)

Here is an approach that is similar but very slightly faster than @Divakar's (at the time of writing, subject to change...).

import numpy as np

def pp():
    if a.dtype.kind == "i":
        small = np.iinfo(a.dtype).min
    else:
        small = -np.inf
    ba = np.concatenate([[small],b,a])
    idx = ba.argsort(kind="stable")
    aux = np.where(idx<=b.size,-1,1)
    aux = aux.cumsum()
    valid = aux==np.maximum.accumulate(aux)
    valid[0] = False
    valid[1:] &= valid[:-1]
    aux2 = np.zeros(ba.size,bool)
    aux2[idx[valid]] = True
    return ba[aux2.nonzero()]

def groupby_cumcount(idx):
    mask = np.r_[False,idx[:-1]==idx[1:],False]
    ids = mask[:-1].cumsum()
    count = np.diff(np.flatnonzero(~mask))
    return ids - np.repeat(ids[~mask[:-1]],count)

def diff_v3():
    # Get sorted orders
    sidx = a.argsort(kind='stable')
    A = a[sidx]
    
    # Get searchsorted indices per sorted order
    idx = np.searchsorted(A,b)
    
    # Get increments
    idx = np.sort(idx)
    inc = groupby_cumcount(idx)
    
    # Delete elemnents off traced back positions
    return np.delete(a,sidx[idx+inc])

np.random.seed(0)
a = np.random.randint(0,5000,10000)
b = a[np.random.choice(len(a), 5000,replace=False)]

from timeit import timeit

print(timeit(pp,number=100)*10)
print(timeit(diff_v3,number=100)*10)
print((pp() == diff_v3()).all())

np.random.seed(0)
a = np.random.randint(0,50000,100000)
b = a[np.random.choice(len(a), 50000,replace=False)]

print(timeit(pp,number=10)*100)
print(timeit(diff_v3,number=10)*100)
print((pp() == diff_v3()).all())

Sample run:

1.4644702401710674
1.6345531499246135
True
22.230969095835462
24.67835019924678
True

Update: corresponding timings for @MateenUlhaq's dedup_unique :

7.986748410039581
81.83312350302003

Please note that the results produced by this function are not (at least not trivially) identical to Divakar's and mine.

Your method :

def dedup_reference(a, b):
    for el in b:
        idx = (el == a).argmax()
        if a[idx] == el:
            a = np.delete(a, idx)
    return a

Scan method with sorting of inputs required:

def dedup_scan(arr, sel):
    arr.sort()
    sel.sort()
    mask = np.ones_like(arr, dtype=np.bool)
    sel_idx = 0
    for i, x in enumerate(arr):
        if sel_idx == sel.size:
            break
        if x == sel[sel_idx]:
            mask[i] = False
            sel_idx += 1
    return arr[mask]

np.unique counting method :

def dedup_unique(arr, sel):
    d_arr = dict(zip(*np.unique(arr, return_counts=True)))
    d_sel = dict(zip(*np.unique(sel, return_counts=True)))
    d = {k: v - d_sel.get(k, 0) for k, v in d_arr.items()}
    res = np.empty(sum(d.values()), dtype=arr.dtype)
    idx = 0
    for k, count in d.items():
        res[idx:idx+count] = k
        idx += count
    return res

You could perhaps accomplish the same as above through some clever use of the numpy set functions (eg np.in1d ), but I don't think it's any faster than just using dictionaries.


Here is one lazy attempt at benchmarking (updated to include @Divakar's diff_v2 and diff_v3 methods, too):

>>> def timeit_ab(f, n=10):
...     cmd = f"{f}(a.copy(), b.copy())"
...     t = timeit(cmd, globals=globals(), number=n) / n
...     print("{:.4f} {}".format(t, f))

>>> array_copy = lambda x, y: None

>>> funcs = [
...     'array_copy',
...     'dedup_reference',
...     'dedup_scan',
...     'dedup_unique',
...     'diff_v2',
...     'diff_v3',
... ]

>>> def run_test(maxval, an, bn):
...     global a, b
...     a = np.random.randint(maxval, size=an)
...     b = np.random.choice(a, size=bn, replace=False)
...     for f in funcs:
...         timeit_ab(f)

>>> run_test(10**1, 10000, 5000)
0.0000 array_copy
0.0617 dedup_reference
0.0035 dedup_scan
0.0004 dedup_unique     (*)
0.0020 diff_v2
0.0009 diff_v3

>>> run_test(10**2, 10000, 5000)
0.0000 array_copy
0.0643 dedup_reference
0.0037 dedup_scan
0.0007 dedup_unique     (*)
0.0023 diff_v2
0.0013 diff_v3

>>> run_test(10**3, 10000, 5000)
0.0000 array_copy
0.0641 dedup_reference
0.0041 dedup_scan
0.0022 dedup_unique
0.0027 diff_v2
0.0016 diff_v3          (*)

>>> run_test(10**4, 10000, 5000)
0.0000 array_copy
0.0635 dedup_reference
0.0041 dedup_scan
0.0082 dedup_unique
0.0029 diff_v2
0.0015 diff_v3          (*)

>>> run_test(10**5, 10000, 5000)
0.0000 array_copy
0.0635 dedup_reference
0.0041 dedup_scan
0.0118 dedup_unique
0.0031 diff_v2
0.0016 diff_v3          (*)

>>> run_test(10**6, 10000, 5000)
0.0000 array_copy
0.0627 dedup_reference
0.0043 dedup_scan
0.0126 dedup_unique
0.0032 diff_v2
0.0016 diff_v3          (*)

Takeaways:

  • dedup_reference slows down significantly as the number of duplicates increases.
  • dedup_unique is fastest if the range of values is small. diff_v3 is pretty fast and does not depend on the range of values.
  • Array copy times are negligible.
  • Dictionaries are pretty cool.

The performance characteristics strongly depend on both the amount of data (not tested), and also the statistical distributions of the data. I recommend testing the methods with your own data and picking the fastest. Note that the various solutions produce different outputs, and make different assumptions about the inputs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM