简体   繁体   中英

Using numpy.bincount with array weights

I would like to use bincount to sum arrays, however it supports only doubles. For example this works:

np.bincount([1, 1, 0],weights=np.array([1, 2, 4]))
Out: array([ 4.,  3.])

However I would like to use a dimension 2 array as:

np.bincount([1, 1, 0],weights=np.array([[1,1], [2,2], [4,4]]))
ValueError: object too deep for desired array

The desired output is:

Out: array([[ 4.,  4.],[3., 3.]])

Better explanation after comments:

I want to sum each row of the array into the corresponding index.

With a loop it would be:

Bin=np.zeros(2,2)
for i in [1,1,0]:
    Bin[i]+=a[i]

a is previous 3x2 matrix Is there an efficient way to obtain this result?

As per the numpy documentation:

numpy.bincount(x, weights=None, minlength=None)

weights : array_like, optional; Weights, array of the same shape as x.

So you can't use bincount directly in this fashion unless you alter x somehow.

Edit: So I came up with a slightly tricky way of doing this, but no guarantees about the performance when you go to large arrays. Basically I'm going to leverage how scipy sparse matrices handle repeated entries at the same indices (they sum them):

 from scipy.sparse import *
 a = np.array([[1,1], [2,2], [4,4]])
 ii = np.array([1, 1, 0])

 ares = a.reshape((-1,),order='F')
 # ares == array([1, 2, 4, 1, 2, 4])

 col = np.tile(ii,(a.shape[1],))
 # col == np.array([1, 1, 0, 1, 1, 0])

 row = np.tile([0,1],(a.shape[0],1)).reshape((-1,),order='F') 
 # row == np.array([0,0,0,1,1,1]) 

 g = coo_matrix((ares,(col,row)),shape=(2,2))
 print g.todense()     

Now you're going to have to generalize this to your precise data. The basic idea is that you want to map each data point to the correct element of your results array and then let the sparse array handle summing the duplicate entries.

Otherwise, I would look into using Cython if you are forced to use looping to solve this.

Edit 2: For kicks, I timed two different methods:

import numpy as np
from scipy.sparse import *

def method1():
    return np.array([np.bincount(ii, r) for r in a.T]).T

def method2():
    ares = a.reshape((-1,),order='F')
    col = np.tile(ii,(a.shape[1],))
    row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F') 

    return coo_matrix((ares,(col,row)),shape=(np.unique(ii).size,a.shape[1])).todense()

if __name__ == '__main__':
    from timeit import Timer

    a = np.random.randint(0,1000,(1000000,3))
    ii = np.random.randint(0,10,(a.shape[0],))

    N = 100
    t1 = Timer("method1()", "from __main__ import method1")
    t2 = Timer("method2()", "from __main__ import method2")
    print 't2/t1: %f' % (t2.timeit(N)/t1.timeit(N))

On my machine, method2 is about 3-5x slower than method1 depending on the shape of the inputs so looping isn't necessarily a bad option.

I think this is a simpler version for the thing you asked, based on bincount. In essence you break your weights by column in order to have the same size as your initial array and then stack the different columns together.

a = np.array([1, 1, 0])
b = np.array([[1,1], [2,2], [4,4]])
uni = np.unique(a)

a_x = np.bincount(a,  weights=b[:,0], minlength=len(uni))
a_y = np.bincount(a,  weights=b[:,1], minlength=len(uni))

final = np.column_stack((a_x.T, a_y.T))
# final = np.array([[ 4.,  4.],[3., 3.]])

You should use scipy csr matrix to represent the indices, and then a dot product with your data. On my laptop it's 14x faster than @JoshAdel's method1 and 54x faster than @JoshAdel's method2 for big matrices.

def method1():
    return np.array([np.bincount(ii, r) for r in a.T]).T

def method2():
    ares = a.reshape((-1,),order='F')
    col = np.tile(ii,(a.shape[1],))
    row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F') 

    return coo_matrix((ares,(col,row)),shape=(ii.max()+1,a.shape[1])).todense()

def method3():
    csr = csr_matrix((np.ones(ii.shape[0]), (ii, np.arange(ii.shape[0]))))
    return csr*a

Let's generate random data and time them:

n = 1<<18
d = 512
ii = np.random.randint(low=1, high=1<<10, size=n)
a = np.random.randn((n, d))

%timeit method1()
# 1 loop, best of 3: 3.13 s per loop

%timeit method2()
# 1 loop, best of 3: 11.7 s per loop

%timeit method3()
# 1 loop, best of 3: 216 ms per loop

# sanity checks:
assert (method1() == method2()).all()
assert (method1() == method3()).all()

note: I replaced np.unique(ii).size in method2 by ii.max()+1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM