简体   繁体   中英

Improving performance by avoiding python for cycle when using counts of numpy unique

I have two numpy arrays, A with shape (N,3) and B with shape (N,) and I generate from the vector A the vector with unique entries, eg:

A = np.array([[1.,2.,3.],
              [4.,5.,6.],
              [1.,2.,3.],
              [7.,8.,9.]])

B = np.array([10.,33.,15.,17.])

AUnique, directInd, inverseInd, counts = np.unique(A, 
                                             return_index = True, 
                                             return_inverse = True, 
                                             return_counts = True, 
                                             axis = 0)

So that AUnique will be array([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])

Then I obtain the simil-vector B associated to AUnique , and for each non-unique row in A I sum the associated values of B in this vector, that is:

BNew = B[directInd] 

# here BNew is [10., 33.,17]

for Id in np.asarray(counts>1).nonzero()[0]: 
  BNew[Id] = np.sum(B[inverseInd == Id])

# here BNew is [25., 33.,17]

The problem is that the for cycle gets extremely slow for large N vectors (millions or tens of millions rows), and I was wondering if there is a way to avoid cycling and/or to make the code much faster.

Thanks in advance!

I think you can do what you want with np.bincount

BNew = np.bincount(inverseInd, weights = B)
BNew

Out[]: array([25., 33., 17.])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM