I have two numpy arrays, one containing values and one containing each values category.
values=np.array([1,2,3,4,5,6,7,8,9,10])
valcats=np.array([101,301,201,201,102,302,302,202,102,301])
I have another array containing the unique categories I'd like to sum across.
categories=np.array([101,102,201,202,301,302])
My issue is that I will be running this same summing process a few billion times and every microsecond matters.
My current implementation is as follows.
catsums=[]
for x in categories:
catsums.append(np.sum(values[np.where(valcats==x)]))
The resulting catsums should be:
[1, 14, 7, 8, 12, 13]
My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about.
Thanks for reading!
@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.
I'd use pd.factorize
to factorize the categories. Then use np.bincount
with weights
parameter set to be the values
array
f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)
array([ 1, 12, 7, 14, 13, 8])
pd.factorize
also produces the unique values in the u
variable. We can line up the results with u
to see that we've arrived at the correct solution.
np.column_stack([u, np.bincount(f, values).astype(values.dtype)])
array([[101, 1],
[301, 12],
[201, 7],
[102, 14],
[302, 13],
[202, 8]])
You can make this more obvious using a pd.Series
f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)
101 1
301 12
201 7
102 14
302 13
202 8
dtype: int64
Why pd.factorize
and not np.unique
?
We could have done this equivalently with
u, f = np.unique(valcats, return_inverse=True)
But, np.unique
sorts the values and that runs in nlogn
time. On the other hand pd.factorize
does not sort and runs in linear time. For larger data sets, pd.factorize
will dominate performance.
你可以使用searchsorted
和bincount
-
np.bincount(np.searchsorted(categories, valcats), values)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.