简体   繁体   中英

NumbaPro - Smartest way to sort a 2d array and then sum over entries of same key

In my program I have an array with the size of multiple million entries like this:

arr=[(1,0.5), (4,0.2), (321, 0.01), (2, 0.042), (1, 0.01), ...]

I could instead make two arrays with the same order (instead of an array with touples) if that helps.

For sorting this array I know I can use radix sort so it has this structure:

arr_sorted = [(1,0.5), (1,0.01), (2,0.42), ...]

Now I want to sum over all the values from the array that have the key 1. Then all that have the key 2 etc. That should be written into a new array like this:

arr_summed = [(1, 0.51), (2,0.42), ...]

Obviously this array would be much smaller, although still on the order of 100000 Entrys. Now my question is: What's the best parallel approach to my problem in CUDA? I am using NumbaPro.

Edit for clarity

I would have two arrays instead of a list of tuples like this:

keys = [1, 2, 5, 2, 6, 4, 4, 65, 3215, 1, .....]
values = [0.1, 0.4, 0.123, 0.01, 0.23, 0.1, 0.1, 0.4 ...]

They are initially numpy arrays that get copied to the device.

What I want is to reduce them by key and if possible set missing key values (for example if three doesn't appear in the array) to zero.

So I would want it go become:

keys = [1, 2, 3, 4, 5, 6, 7, 8, ...]
values = [0.11, 0.41, 0, 0.2, ...] # <- Summed by key

I know how big the final array will be beforehand.

I don't know Numba, but in simple Python:

arr=[(1,0.5), (4,0.2), (321, 0.01), (2, 0.042), (1, 0.01), ...]
res = [0.0] * (indexmax + 1)
for k, v in arr:
   res[k] += v

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM