简体   繁体   中英

Applying a function by bins on a vector in Numpy

How would I go about applying an aggregating function (such as " sum() " or " max() ") to bins in a vector.

That is if I have:

  1. a vector of values x of length N
  2. a vector of bin tags b of length N

such that b indicates to what bin each value in x belongs. for every possible value in ba I want to apply the aggregating function "func()" on all the values of x that belong to that bin.

>> x = [1,2,3,4,5,6]
>> b = ["a","b","a","a","c","c"]    

the output should be 2 vectors (say the aggregating function is the product function):

>>(labels, y) = apply_to_bins(values = x, bins = b, func = prod)

labels = ["a","b","c"]
y = [12, 2, 30]

I want to do this as elegantly as possible in numpy (or just python), since obviously I could just "for loop" over it.

import itertools as it
import operator as op

def apply_to_bins(values, bins, func):
    return {k: func(x[1] for x in v) for k,v in it.groupby(sorted(zip(bins, values), key=op.itemgetter(0)), key=op.itemgetter(0))}

x = [1,2,3,4,5,6]
b = ["a","b","a","a","c","c"]   

print apply_to_bins(x, b, sum) # returns {'a': 8, 'b': 2, 'c': 11}
print apply_to_bins(x, b, max) # returns {'a': 4, 'b': 2, 'c': 6}
>>> from itertools import groupby
>>> x = np.array([1, 2, 3, 4, 5, 6])
>>> zip(*[(k, np.product(x[list(v)]))
...       for k, v in groupby(np.argsort(b), key=lambda i: b[i])])
[('a', 'b', 'c'), (12, 2, 30)]

Or, step by step:

>>> np.argsort(b)
array([0, 2, 3, 1, 4, 5])

List of indices into b (or x ) in sorted order by the keys in b .

>>> [(k, list(v)) for k, v in groupby(np.argsort(b), key=lambda i: b[i])]
[('a', [0, 2, 3]), ('b', [1]), ('c', [4, 5])]

Indices grouped by key from b .

>>> [(k, x[list(v)]) for k, v in groupby(np.argsort(b), key=lambda i: b[i])]
[('a', array([1, 3, 4])), ('b', array([2])), ('c', array([5, 6]))]

Use the indices to get the right elements from x .

>>> [(k, np.product(x[list(v)]))
...  for k, v in groupby(np.argsort(b), key=lambda i: b[i])]
[('a', 12), ('b', 2), ('c', 30)]

Apply np.product .

So, putting everything together,

def apply_to_bins(values, bins, op):
    grouped = groupby(np.argsort(bins), key=lambda i: bins[i])
    applied = [(bin, op(x[list(indices)])) for bin, indices in grouped]
    return zip(*applied)

IF you are going to be doing this sort of thing, I would strongly suggest using the Pandas package. There is a nice groupby() method that you can call on a data frame or Series that makes this sort of thing easy.

Example:


In [450]: lst = [1, 2, 3, 1, 2, 3]
In [451]: s = Series([1, 2, 3, 10, 20, 30], lst)
In [452]: grouped = s.groupby(level=0)
In [455]: grouped.sum()
Out[455]: 
1    11
2    22
3    33

There are a couple of interesting solutions that don't depend on groupby . The first is really simple:

def apply_to_bins(func, values, bins):
    return zip(*((bin, func(values[bins == bin])) for bin in set(bins)))

This uses "fancy indexing" instead of grouping, and performs reasonably well for small inputs; a list-comprehension-based variation does a bit better (see below for timings).

def apply_to_bins2(func, values, bins):
    bin_names = sorted(set(bins))
    return bin_names, [func(values[bins == bin]) for bin in bin_names]

These have the advantage of being pretty readable. Both also fare better than groupby for small inputs, but they get much slower for large inputs, especially when there are many bins; their performance is O(n_items * n_bins) . A different numpy -based approach is slower for small inputs, but much faster for large inputs, and especially so for large inputs with lots of bins:

def apply_to_bins3(func, values, bins):
    bins_argsort = bins.argsort()
    values = values[bins_argsort]
    bins = bins[bins_argsort]
    group_indices = (bins[1:] != bins[:-1]).nonzero()[0] + 1
    groups = numpy.split(values, group_indices)
    return numpy.unique(bins), [func(g) for g in groups]

Some tests. First for small inputs:

>>> def apply_to_bins_groupby(func, x, b):
...         return zip(*[(k, np.product(x[list(v)]))
...                  for k, v in groupby(np.argsort(b), key=lambda i: b[i])])
... 
>>> x = numpy.array([1, 2, 3, 4, 5, 6])
>>> b = numpy.array(['a', 'b', 'a', 'a', 'c', 'c'])
>>> 
>>> %timeit apply_to_bins(numpy.prod, x, b)
10000 loops, best of 3: 31.9 us per loop
>>> %timeit apply_to_bins2(numpy.prod, x, b)
10000 loops, best of 3: 29.6 us per loop
>>> %timeit apply_to_bins3(numpy.prod, x, b)
10000 loops, best of 3: 122 us per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
10000 loops, best of 3: 67.9 us per loop

The apply_to_bins3 doesn't fare too well here, but it's still less than an order of magnitude slower than the fastest. It does better when n_items gets larger:

>>> x = numpy.arange(1, 100000)
>>> b_names = numpy.array(['a', 'b', 'c', 'd'])
>>> b = b_names[numpy.random.random_integers(0, 3, 99999)]
>>> 
>>> %timeit apply_to_bins(numpy.prod, x, b)
10 loops, best of 3: 27.8 ms per loop
>>> %timeit apply_to_bins2(numpy.prod, x, b)
10 loops, best of 3: 27 ms per loop
>>> %timeit apply_to_bins3(numpy.prod, x, b)
100 loops, best of 3: 13.7 ms per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
10 loops, best of 3: 124 ms per loop

And when n_bins goes up, the first two approaches take too long to bother showing here -- around five seconds. apply_to_bins3 is the clear winner here.

>>> x = numpy.arange(1, 100000)
>>> bn_product = product(['a', 'b', 'c', 'd', 'e'], repeat=5)
>>> b_names = numpy.array(list(''.join(s) for s in bn_product))
>>> b = b_names[numpy.random.random_integers(0, len(b_names) - 1, 99999)]
>>> 
>>> %timeit apply_to_bins3(numpy.prod, x, b)
10 loops, best of 3: 109 ms per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
1 loops, best of 3: 205 ms per loop

Overall, groupby is probably fine in most cases, but is unlikely to scale well, as suggested by this thread . Using a pure(er) numpy approach, is slower for small inputs, but only by a bit; the tradeoff is a good one.

With pandas groupby this would be

import pandas as pd

def with_pandas_groupby(func, x, b):
    grouped = pd.Series(x).groupby(b)
    return grouped.agg(func)

Using the example of the OP:

>>> x = [1,2,3,4,5,6]
>>> b = ["a","b","a","a","c","c"]
>>> with_pandas_groupby(np.prod, x, b)
a    12
b     2
c    30

I was just interessted in the speed and so I compared with_pandas_groupby with some functions given in the answer of senderle .

  • apply_to_bins_groupby

      3 levels, 100 values: 175 us per loop 3 levels, 1000 values: 1.16 ms per loop 3 levels, 1000000 values: 1.21 s per loop 10 levels, 100 values: 304 us per loop 10 levels, 1000 values: 1.32 ms per loop 10 levels, 1000000 values: 1.23 s per loop 26 levels, 100 values: 554 us per loop 26 levels, 1000 values: 1.59 ms per loop 26 levels, 1000000 values: 1.27 s per loop 
  • apply_to_bins3

      3 levels, 100 values: 136 us per loop 3 levels, 1000 values: 259 us per loop 3 levels, 1000000 values: 205 ms per loop 10 levels, 100 values: 297 us per loop 10 levels, 1000 values: 447 us per loop 10 levels, 1000000 values: 262 ms per loop 26 levels, 100 values: 617 us per loop 26 levels, 1000 values: 795 us per loop 26 levels, 1000000 values: 299 ms per loop 
  • with_pandas_groupby

      3 levels, 100 values: 365 us per loop 3 levels, 1000 values: 443 us per loop 3 levels, 1000000 values: 89.4 ms per loop 10 levels, 100 values: 369 us per loop 10 levels, 1000 values: 453 us per loop 10 levels, 1000000 values: 88.8 ms per loop 26 levels, 100 values: 382 us per loop 26 levels, 1000 values: 466 us per loop 26 levels, 1000000 values: 89.9 ms per loop 

So pandas is the fastest for large item size. Further more the number of levels (bins) has no big influence on computation time. (Note that the time is calculated starting from numpy arrays and the time to create the pandas.Series is included)

I generated the data with:

def gen_data(levels, size):
    choices = 'abcdefghijklmnopqrstuvwxyz'
    levels = np.asarray([l for l in choices[:nlevels]])
    index = np.random.random_integers(0, levels.size - 1, size)
    b = levels[index]
    x = np.arange(1, size + 1)
    return x, b

And then run the benchmark in ipython like this:

In [174]: for nlevels in (3, 10, 26):
   .....:     for size in (100, 1000, 10e5):
   .....:         x, b = gen_data(nlevels, size)
   .....:         print '%2d levels, ' % nlevels, '%7d values:' % size,
   .....:         %timeit function_to_time(np.prod, x, b)
   .....:     print

In the special case where the aggregation function func can be expressed as a sum, then bincount seems faster than pandas . For example when func is the product, it can be expressed as a sum of logarithms and we can do:

x = np.arange( 1000000 )
b = nr.randint( 0, 100, 1000000 )

def apply_to_bincount( values, bins ) :
    logy = np.bincount( bins, weights=np.log( values ) )
    return np.arange(len(logy)), np.exp( logy )

%%timeit
apply_to_bincount( x, b )
10 loops, best of 3: 16.9 ms per loop

%%timeit
with_pandas_groupby( np.prod, x, b )
10 loops, best of 3: 36.2 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM