简体   繁体   中英

Find numpy vectors in a set quickly

I have a numpy array, for example:

a = np.array([[1,2],
              [3,4],
              [6,4],
              [5,3],
              [3,5]])

and I also have a set

b = set((1,2),(6,4),(9,9))

I want to find the index of vectors that exist in set b, here is

[0, 2]

but I use a for loop to implement this, is there a convinient way to do this job avoiding for loop? The for loop method I used:

record = []
for i in range(a.shape[0]):
    if (a[i, 0], a[i, 1]) in b:
        record.append(i)

You can use filter:

In [8]: a = np.array([[1,2],
              [3,4],
              [6,4],
              [5,3],
              [3,5]])

In [9]: b = {(1,2),(6,4)}

In [10]: filter(lambda x: tuple(a[x]) in b, range(len(a)))
Out[10]: [0, 2]

First off, convert the set to a NumPy array -

b_arr = np.array(list(b))

Then, based on this post , you would have three approaches. Let's use the second approach for efficiency -

dims = np.maximum(a.max(0),b_arr.max(0)) + 1
a1D = np.ravel_multi_index(a.T,dims)
b1D = np.ravel_multi_index(b_arr.T,dims)    
out = np.flatnonzero(np.in1d(a1D,b1D))

Sample run -

In [89]: a
Out[89]: 
array([[1, 2],
       [3, 4],
       [6, 4],
       [5, 3],
       [3, 5]])

In [90]: b
Out[90]: {(1, 2), (6, 4), (9, 9)}

In [91]: b_arr = np.array(list(b))

In [92]: dims = np.maximum(a.max(0),b_arr.max(0)) + 1
    ...: a1D = np.ravel_multi_index(a.T,dims)
    ...: b1D = np.ravel_multi_index(b_arr.T,dims)    
    ...: out = np.flatnonzero(np.in1d(a1D,b1D))
    ...: 

In [93]: out
Out[93]: array([0, 2])

For reference, a straight forward list comprehension (loop) answer:

In [108]: [i for i,v in enumerate(a) if tuple(v) in b]
Out[108]: [0, 2]

basically the same speed as the filter approach:

In [111]: timeit [i for i,v in enumerate(a) if tuple(v) in b]
10000 loops, best of 3: 24.5 µs per loop

In [114]: timeit list(filter(lambda x: tuple(a[x]) in b, range(len(a))))
10000 loops, best of 3: 29.7 µs per loop

But this is a toy example, so timings aren't meaningful.

If a wasn't already an array, these list approaches would be faster than the array ones, due to the overhead of creating arrays.

There are some numpy set operations, but they work with 1d arrays. We can get around that by converting 2d arrays to 1d structured.

In [117]: a.view('i,i')
Out[117]: 
array([[(1, 2)],
       [(3, 4)],
       [(6, 4)],
       [(5, 3)],
       [(3, 5)]], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])
In [119]: np.array(list(b),'i,i')
Out[119]: 
array([(1, 2), (6, 4), (9, 9)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])

There is a version of this using np.void , but it's easier to remember and play with this 'i,i' dtype.

So this works:

In [123]: np.nonzero(np.in1d(a.view('i,i'),np.array(list(b),'i,i')))[0]
Out[123]: array([0, 2], dtype=int32)

but it is much slower than the iterations:

In [124]: timeit np.nonzero(np.in1d(a.view('i,i'),np.array(list(b),'i,i')))[0]
10000 loops, best of 3: 153 µs per loop

As discussed in other recent union questions, np.in1d uses several strategies. One is based on broadcasting and where . The other uses unique , concatenation , sorting and differences.

A broadcasting solution (yes, it's messy) - but faster than in1d .

In [150]: timeit np.nonzero((a[:,:,None,None]==np.array(list(b))[:,:]).any(axis=-1).any(axis=-1).all(axis=-1))[0]
10000 loops, best of 3: 52.2 µs per loop

A one line solution using a list comprehension:

In [62]: a = np.array([[1,2],
    ...:               [3,4],
    ...:               [6,4],
    ...:               [5,3],
    ...:               [3,5]])

In [63]: b = set(((1,2),(6,4),(9,9)))
In [64]: where([tuple(e) in b for e in a])[0]
Out[64]: array([0, 2])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM