简体   繁体   中英

Efficiently find indices of all values in an array

I have a very large array, consisting of integers between 0 and N, where each value occurs at least once.

I'd like to know, for each value k , all the indices in my array where the array's value equals k .

For example:

arr = np.array([0,1,2,3,2,1,0])
desired_output = {
    0: np.array([0,6]),
    1: np.array([1,5]),
    2: np.array([2,4]),
    3: np.array([3]),
    }

Right now I am accomplishing this with a loop over range(N+1) , and calling np.where N times.

indices = {}
for value in range(max(arr)+1):
    indices[value] = np.where(arr == value)[0]

This loop is by far the slowest part of my code. (Both the arr==value evaluation and the np.where call take up significant chunks of time.) Is there a more efficient way to do this?

I also tried playing around with np.unique(arr, return_index=True) but that only tells me the very first index, rather than all of them.

Approach #1

Here's a vectorized approach to get those indices as list of arrays -

sidx = arr.argsort()
unq, cut_idx = np.unique(arr[sidx],return_index=True)
indices = np.split(sidx,cut_idx)[1:]

If you want the final dictionary that corresponds each unique element to their indices, finally we can use a loop-comprehension -

dict_out = {unq[i]:iterID for i,iterID in enumerate(indices)}

Approach #2

If you are just interested in the list of arrays, here's an alternative meant for performance -

sidx = arr.argsort()
indices = np.split(sidx,np.flatnonzero(np.diff(arr[sidx])>0)+1)

A pythonic way is using collections.defaultdict() :

>>> from collections import defaultdict
>>> 
>>> d = defaultdict(list)
>>> 
>>> for i, j in enumerate(arr):
...     d[j].append(i)
... 
>>> d
defaultdict(<type 'list'>, {0: [0, 6], 1: [1, 5], 2: [2, 4], 3: [3]})

And here is a Numpythonic way using a dictionary comprehension and numpy.where() :

>>> {i: np.where(arr == i)[0] for i in np.unique(arr)}
{0: array([0, 6]), 1: array([1, 5]), 2: array([2, 4]), 3: array([3])}

And here is a pure Numpythonic approach if you don't want to involve the dictionary:

>>> uniq = np.unique(arr)
>>> args, indices = np.where((np.tile(arr, len(uniq)).reshape(len(uniq), len(arr)) == np.vstack(uniq)))
>>> np.split(indices, np.where(np.diff(args))[0] + 1)
[array([0, 6]), array([1, 5]), array([2, 4]), array([3])]

I don't know numpy but you could definitely do this in one iteration, with a defaultdict:

indices = defaultdict(list)
for i, val in enumerate(arr):
    indices[val].append(i)

Fully vectorized solution using the numpy_indexed package:

import numpy_indexed as npi
k, idx = npi.groupy_by(arr, np.arange(len(arr)))

On a higher level; why do you need these indices? Subsequent grouped-operations can usually be computed much more efficiently using the group_by functionality [eg, npi.group_by(arr).mean(someotherarray)], without explicitly computing the indices of the keys.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM