简体   繁体   中英

NumPy sum one array based on values in another array for each matching element in 3rd array

I have two numpy arrays, one containing values and one containing each values category.

values=np.array([1,2,3,4,5,6,7,8,9,10])
valcats=np.array([101,301,201,201,102,302,302,202,102,301])

I have another array containing the unique categories I'd like to sum across.

categories=np.array([101,102,201,202,301,302])

My issue is that I will be running this same summing process a few billion times and every microsecond matters.

My current implementation is as follows.

catsums=[]
for x in categories:
    catsums.append(np.sum(values[np.where(valcats==x)]))

The resulting catsums should be:

[1, 14, 7, 8, 12, 13]

My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about.

Thanks for reading!

@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.


I'd use pd.factorize to factorize the categories. Then use np.bincount with weights parameter set to be the values array

f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)

array([ 1, 12,  7, 14, 13,  8])

pd.factorize also produces the unique values in the u variable. We can line up the results with u to see that we've arrived at the correct solution.

np.column_stack([u, np.bincount(f, values).astype(values.dtype)])

array([[101,   1],
       [301,  12],
       [201,   7],
       [102,  14],
       [302,  13],
       [202,   8]])

You can make this more obvious using a pd.Series

f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)

101     1
301    12
201     7
102    14
302    13
202     8
dtype: int64

Why pd.factorize and not np.unique ?

We could have done this equivalently with

 u, f = np.unique(valcats, return_inverse=True)

But, np.unique sorts the values and that runs in nlogn time. On the other hand pd.factorize does not sort and runs in linear time. For larger data sets, pd.factorize will dominate performance.

你可以使用searchsortedbincount -

np.bincount(np.searchsorted(categories, valcats), values)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM