简体   繁体   中英

Fast way for weighted counting of numpy arrays

I have two 2D numpy arrays with the same shape:

idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])

I know that we can use np.bincount setting val as weights:

np.bincount(idx.reshape(-1), weights=val.reshape(-1))

But this is not exactly what I want. np.bincount put zeros where the indexes do not exist. In the example, the results are:

array([0. , 0.2, 0.7, 0. , 0. , 1.1, 0.2])

But I do not want these extra zeros for the non-exists indexes. I want the weighted counts corresponds to np.unique(idx)

array([1, 2, 3, 5, 6])

And my expected results are:

array([0.2, 0.7, 0., 1.1, 0.2])

Anyone has an idea to do it efficiently? My idx and val are very large with more than 1 Million elements.

You can use numpy library effectively.

Check this out:

output = []
for i in np.unique(idx):
    wt = (idx == i)
    if i == 0:
        zeros = wt*(idx+1)
        l = np.sum(zeros*val)
    else:
        zeros = wt*idx
        l = np.sum(zeros*val)/i
    output.append(l)
print(output)

This is pretty fast. I hope it helps.

As you may know, having for loops in python is not a good idea for efficiency:

You can try indexing the output of the bincount with the np.unique method:

>>> np.bincount(idx.reshape(-1), val.reshape(-1))[np.unique(idx)]
array([0.2, 0.7, 0. , 1.1, 0.2])

If you just want to rid off from zeros probably this is the fastest way.

The key to success is to:

  • perform mapping of unique values from idx to consecutive integers, starting from 0 ,
  • compute bincount on the result of the above mapping, instead of idx itself.

The code to do it (quite concise and without any loop) is:

unq = np.unique(idx)
mapper = pd.Series(range(unq.size), index=unq)
np.bincount(mapper[idx.reshape(-1)], weights=val.reshape(-1))

For your sample data, the result is:

array([0.2, 0.7, 0. , 1.1, 0.2])

Method 1:

Use np.unique with return_inverse=True .

idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])

unq,inv=np.unique(idx,return_inverse=True)
np.bincount(inv,val.reshape(-1))
# array([0.2, 0.7, 0. , 1.1, 0.2])

Method 2:

Use bincount and then remove the (genuine) zeros.

np.bincount(idx.reshape(-1),val.reshape(-1))[np.bincount(idx.reshape(-1)).nonzero()]
# array([0.2, 0.7, 0. , 1.1, 0.2])

Which is better will depend on how spread out idx is.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM