I have a very large array, consisting of integers between 0 and N, where each value occurs at least once.
I'd like to know, for each value k , all the indices in my array where the array's value equals k .
For example:
arr = np.array([0,1,2,3,2,1,0])
desired_output = {
0: np.array([0,6]),
1: np.array([1,5]),
2: np.array([2,4]),
3: np.array([3]),
}
Right now I am accomplishing this with a loop over range(N+1)
, and calling np.where
N times.
indices = {}
for value in range(max(arr)+1):
indices[value] = np.where(arr == value)[0]
This loop is by far the slowest part of my code. (Both the arr==value
evaluation and the np.where
call take up significant chunks of time.) Is there a more efficient way to do this?
I also tried playing around with np.unique(arr, return_index=True)
but that only tells me the very first index, rather than all of them.
Approach #1
Here's a vectorized approach to get those indices as list of arrays -
sidx = arr.argsort()
unq, cut_idx = np.unique(arr[sidx],return_index=True)
indices = np.split(sidx,cut_idx)[1:]
If you want the final dictionary that corresponds each unique element to their indices, finally we can use a loop-comprehension -
dict_out = {unq[i]:iterID for i,iterID in enumerate(indices)}
Approach #2
If you are just interested in the list of arrays, here's an alternative meant for performance -
sidx = arr.argsort()
indices = np.split(sidx,np.flatnonzero(np.diff(arr[sidx])>0)+1)
A pythonic way is using collections.defaultdict()
:
>>> from collections import defaultdict
>>>
>>> d = defaultdict(list)
>>>
>>> for i, j in enumerate(arr):
... d[j].append(i)
...
>>> d
defaultdict(<type 'list'>, {0: [0, 6], 1: [1, 5], 2: [2, 4], 3: [3]})
And here is a Numpythonic way using a dictionary comprehension and numpy.where()
:
>>> {i: np.where(arr == i)[0] for i in np.unique(arr)}
{0: array([0, 6]), 1: array([1, 5]), 2: array([2, 4]), 3: array([3])}
And here is a pure Numpythonic approach if you don't want to involve the dictionary:
>>> uniq = np.unique(arr)
>>> args, indices = np.where((np.tile(arr, len(uniq)).reshape(len(uniq), len(arr)) == np.vstack(uniq)))
>>> np.split(indices, np.where(np.diff(args))[0] + 1)
[array([0, 6]), array([1, 5]), array([2, 4]), array([3])]
I don't know numpy but you could definitely do this in one iteration, with a defaultdict:
indices = defaultdict(list)
for i, val in enumerate(arr):
indices[val].append(i)
Fully vectorized solution using the numpy_indexed package:
import numpy_indexed as npi
k, idx = npi.groupy_by(arr, np.arange(len(arr)))
On a higher level; why do you need these indices? Subsequent grouped-operations can usually be computed much more efficiently using the group_by functionality [eg, npi.group_by(arr).mean(someotherarray)], without explicitly computing the indices of the keys.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.