I have two 2D numpy arrays with the same shape:
idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])
I know that we can use np.bincount
setting val
as weights:
np.bincount(idx.reshape(-1), weights=val.reshape(-1))
But this is not exactly what I want. np.bincount
put zeros where the indexes do not exist. In the example, the results are:
array([0. , 0.2, 0.7, 0. , 0. , 1.1, 0.2])
But I do not want these extra zeros for the non-exists indexes. I want the weighted counts corresponds to np.unique(idx)
array([1, 2, 3, 5, 6])
And my expected results are:
array([0.2, 0.7, 0., 1.1, 0.2])
Anyone has an idea to do it efficiently? My idx
and val
are very large with more than 1 Million elements.
You can use numpy library effectively.
Check this out:
output = []
for i in np.unique(idx):
wt = (idx == i)
if i == 0:
zeros = wt*(idx+1)
l = np.sum(zeros*val)
else:
zeros = wt*idx
l = np.sum(zeros*val)/i
output.append(l)
print(output)
This is pretty fast. I hope it helps.
As you may know, having for loops in python is not a good idea for efficiency:
You can try indexing the output of the bincount with the np.unique method:
>>> np.bincount(idx.reshape(-1), val.reshape(-1))[np.unique(idx)]
array([0.2, 0.7, 0. , 1.1, 0.2])
If you just want to rid off from zeros probably this is the fastest way.
The key to success is to:
The code to do it (quite concise and without any loop) is:
unq = np.unique(idx)
mapper = pd.Series(range(unq.size), index=unq)
np.bincount(mapper[idx.reshape(-1)], weights=val.reshape(-1))
For your sample data, the result is:
array([0.2, 0.7, 0. , 1.1, 0.2])
Method 1:
Use np.unique
with return_inverse=True
.
idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])
unq,inv=np.unique(idx,return_inverse=True)
np.bincount(inv,val.reshape(-1))
# array([0.2, 0.7, 0. , 1.1, 0.2])
Method 2:
Use bincount and then remove the (genuine) zeros.
np.bincount(idx.reshape(-1),val.reshape(-1))[np.bincount(idx.reshape(-1)).nonzero()]
# array([0.2, 0.7, 0. , 1.1, 0.2])
Which is better will depend on how spread out idx
is.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.