Fast way for weighted counting of numpy arrays

Question

I have two 2D numpy arrays with the same shape:

idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])

I know that we can use np.bincount setting val as weights:

np.bincount(idx.reshape(-1), weights=val.reshape(-1))

But this is not exactly what I want. np.bincount put zeros where the indexes do not exist. In the example, the results are:

array([0. , 0.2, 0.7, 0. , 0. , 1.1, 0.2])

But I do not want these extra zeros for the non-exists indexes. I want the weighted counts corresponds to np.unique(idx)

array([1, 2, 3, 5, 6])

And my expected results are:

array([0.2, 0.7, 0., 1.1, 0.2])

Anyone has an idea to do it efficiently? My idx and val are very large with more than 1 Million elements.

Answer 1

You can use numpy library effectively.

Check this out:

output = []
for i in np.unique(idx):
    wt = (idx == i)
    if i == 0:
        zeros = wt*(idx+1)
        l = np.sum(zeros*val)
    else:
        zeros = wt*idx
        l = np.sum(zeros*val)/i
    output.append(l)
print(output)

This is pretty fast. I hope it helps.

Answer 2

As you may know, having for loops in python is not a good idea for efficiency:

You can try indexing the output of the bincount with the np.unique method:

>>> np.bincount(idx.reshape(-1), val.reshape(-1))[np.unique(idx)]
array([0.2, 0.7, 0. , 1.1, 0.2])

If you just want to rid off from zeros probably this is the fastest way.

Answer 3

The key to success is to:

perform mapping of unique values from idx to consecutive integers, starting from 0 ,
compute bincount on the result of the above mapping, instead of idx itself.

The code to do it (quite concise and without any loop) is:

unq = np.unique(idx)
mapper = pd.Series(range(unq.size), index=unq)
np.bincount(mapper[idx.reshape(-1)], weights=val.reshape(-1))

For your sample data, the result is:

array([0.2, 0.7, 0. , 1.1, 0.2])

Answer 4

Method 1:

Use np.unique with return_inverse=True .

idx = np.array([[1, 2, 5, 6],[1, 3, 5, 2]])
val = np.array([[0.1, 0.5, 0.3, 0.2], [0.1, 0., 0.8, 0.2]])

unq,inv=np.unique(idx,return_inverse=True)
np.bincount(inv,val.reshape(-1))
# array([0.2, 0.7, 0. , 1.1, 0.2])

Method 2:

Use bincount and then remove the (genuine) zeros.

np.bincount(idx.reshape(-1),val.reshape(-1))[np.bincount(idx.reshape(-1)).nonzero()]
# array([0.2, 0.7, 0. , 1.1, 0.2])

Which is better will depend on how spread out idx is.

Fast way for weighted counting of numpy arrays

Question

4 answers

solution1
2 2020-07-01 11:02:24

solution2
2 2020-07-01 11:09:58

solution3
2 2020-07-01 11:12:07

solution4
2 ACCPTED 2020-07-01 11:20:39

Fast way for weighted counting of numpy arrays

Question

4 answers

solution1 2 2020-07-01 11:02:24

solution2 2 2020-07-01 11:09:58

solution3 2 2020-07-01 11:12:07

solution4 2 ACCPTED 2020-07-01 11:20:39

solution1
2 2020-07-01 11:02:24

solution2
2 2020-07-01 11:09:58

solution3
2 2020-07-01 11:12:07

solution4
2 ACCPTED 2020-07-01 11:20:39