简体   繁体   English

在Numpy中的向量上应用bins函数

[英]Applying a function by bins on a vector in Numpy

How would I go about applying an aggregating function (such as " sum() " or " max() ") to bins in a vector. 我将如何将聚合函数(例如“ sum() ”或“ max() ”)应用于向量中的bin。

That is if I have: 那就是我有:

  1. a vector of values x of length N 长度为N的值x的向量
  2. a vector of bin tags b of length N 长度为N的bin标签b的向量

such that b indicates to what bin each value in x belongs. 这样b表示x中每个值所属的bin。 for every possible value in ba I want to apply the aggregating function "func()" on all the values of x that belong to that bin. 对于ba中的每个可能值,我想对属于该bin的x的所有值应用聚合函数“func()”。

>> x = [1,2,3,4,5,6]
>> b = ["a","b","a","a","c","c"]    

the output should be 2 vectors (say the aggregating function is the product function): 输出应该是2个向量(比如聚合函数是产品函数):

>>(labels, y) = apply_to_bins(values = x, bins = b, func = prod)

labels = ["a","b","c"]
y = [12, 2, 30]

I want to do this as elegantly as possible in numpy (or just python), since obviously I could just "for loop" over it. 我想在numpy(或者只是python)中尽可能优雅地做到这一点,因为很明显我只能“for循环”它。

import itertools as it
import operator as op

def apply_to_bins(values, bins, func):
    return {k: func(x[1] for x in v) for k,v in it.groupby(sorted(zip(bins, values), key=op.itemgetter(0)), key=op.itemgetter(0))}

x = [1,2,3,4,5,6]
b = ["a","b","a","a","c","c"]   

print apply_to_bins(x, b, sum) # returns {'a': 8, 'b': 2, 'c': 11}
print apply_to_bins(x, b, max) # returns {'a': 4, 'b': 2, 'c': 6}
>>> from itertools import groupby
>>> x = np.array([1, 2, 3, 4, 5, 6])
>>> zip(*[(k, np.product(x[list(v)]))
...       for k, v in groupby(np.argsort(b), key=lambda i: b[i])])
[('a', 'b', 'c'), (12, 2, 30)]

Or, step by step: 或者,一步一步:

>>> np.argsort(b)
array([0, 2, 3, 1, 4, 5])

List of indices into b (or x ) in sorted order by the keys in b . b的键排序的b (或x )索引列表。

>>> [(k, list(v)) for k, v in groupby(np.argsort(b), key=lambda i: b[i])]
[('a', [0, 2, 3]), ('b', [1]), ('c', [4, 5])]

Indices grouped by key from b . 按键分组的指数b

>>> [(k, x[list(v)]) for k, v in groupby(np.argsort(b), key=lambda i: b[i])]
[('a', array([1, 3, 4])), ('b', array([2])), ('c', array([5, 6]))]

Use the indices to get the right elements from x . 使用索引从x获取正确的元素。

>>> [(k, np.product(x[list(v)]))
...  for k, v in groupby(np.argsort(b), key=lambda i: b[i])]
[('a', 12), ('b', 2), ('c', 30)]

Apply np.product . 申请np.product

So, putting everything together, 所以,把所有东西放在一起

def apply_to_bins(values, bins, op):
    grouped = groupby(np.argsort(bins), key=lambda i: bins[i])
    applied = [(bin, op(x[list(indices)])) for bin, indices in grouped]
    return zip(*applied)

IF you are going to be doing this sort of thing, I would strongly suggest using the Pandas package. 如果您打算做这类事情,我强烈建议使用Pandas软件包。 There is a nice groupby() method that you can call on a data frame or Series that makes this sort of thing easy. 有一个很好的groupby()方法,你可以调用数据框或系列,使这种事情变得容易。

Example: 例:


In [450]: lst = [1, 2, 3, 1, 2, 3]
In [451]: s = Series([1, 2, 3, 10, 20, 30], lst)
In [452]: grouped = s.groupby(level=0)
In [455]: grouped.sum()
Out[455]: 
1    11
2    22
3    33

There are a couple of interesting solutions that don't depend on groupby . 有几个有趣的解决方案不依赖于groupby The first is really simple: 第一个很简单:

def apply_to_bins(func, values, bins):
    return zip(*((bin, func(values[bins == bin])) for bin in set(bins)))

This uses "fancy indexing" instead of grouping, and performs reasonably well for small inputs; 这使用“花式索引”而不是分组,并且对于小输入表现得相当好; a list-comprehension-based variation does a bit better (see below for timings). 基于列表理解的变体做得更好(参见下面的时间)。

def apply_to_bins2(func, values, bins):
    bin_names = sorted(set(bins))
    return bin_names, [func(values[bins == bin]) for bin in bin_names]

These have the advantage of being pretty readable. 它们具有可读性的优点。 Both also fare better than groupby for small inputs, but they get much slower for large inputs, especially when there are many bins; 对于小输入,两者都比groupby更好,但是对于大输入它们会慢得多,特别是当有很多箱子时; their performance is O(n_items * n_bins) . 他们的表现是O(n_items * n_bins) A different numpy -based approach is slower for small inputs, but much faster for large inputs, and especially so for large inputs with lots of bins: 对于小输入,一种不同的基于numpy的方法较慢,但对于大输入则要快得多,特别是对于具有大量二进制位的大输入:

def apply_to_bins3(func, values, bins):
    bins_argsort = bins.argsort()
    values = values[bins_argsort]
    bins = bins[bins_argsort]
    group_indices = (bins[1:] != bins[:-1]).nonzero()[0] + 1
    groups = numpy.split(values, group_indices)
    return numpy.unique(bins), [func(g) for g in groups]

Some tests. 一些测试。 First for small inputs: 首先是小投入:

>>> def apply_to_bins_groupby(func, x, b):
...         return zip(*[(k, np.product(x[list(v)]))
...                  for k, v in groupby(np.argsort(b), key=lambda i: b[i])])
... 
>>> x = numpy.array([1, 2, 3, 4, 5, 6])
>>> b = numpy.array(['a', 'b', 'a', 'a', 'c', 'c'])
>>> 
>>> %timeit apply_to_bins(numpy.prod, x, b)
10000 loops, best of 3: 31.9 us per loop
>>> %timeit apply_to_bins2(numpy.prod, x, b)
10000 loops, best of 3: 29.6 us per loop
>>> %timeit apply_to_bins3(numpy.prod, x, b)
10000 loops, best of 3: 122 us per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
10000 loops, best of 3: 67.9 us per loop

The apply_to_bins3 doesn't fare too well here, but it's still less than an order of magnitude slower than the fastest. apply_to_bins3在这里表现不太好,但它仍然比最快的慢一个数量级。 It does better when n_items gets larger: n_items变大时,它会做得更好:

>>> x = numpy.arange(1, 100000)
>>> b_names = numpy.array(['a', 'b', 'c', 'd'])
>>> b = b_names[numpy.random.random_integers(0, 3, 99999)]
>>> 
>>> %timeit apply_to_bins(numpy.prod, x, b)
10 loops, best of 3: 27.8 ms per loop
>>> %timeit apply_to_bins2(numpy.prod, x, b)
10 loops, best of 3: 27 ms per loop
>>> %timeit apply_to_bins3(numpy.prod, x, b)
100 loops, best of 3: 13.7 ms per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
10 loops, best of 3: 124 ms per loop

And when n_bins goes up, the first two approaches take too long to bother showing here -- around five seconds. n_bins上升时,前两种方法花费太长时间来打扰显示 - 大约五秒钟。 apply_to_bins3 is the clear winner here. apply_to_bins3在这里是明显的赢家。

>>> x = numpy.arange(1, 100000)
>>> bn_product = product(['a', 'b', 'c', 'd', 'e'], repeat=5)
>>> b_names = numpy.array(list(''.join(s) for s in bn_product))
>>> b = b_names[numpy.random.random_integers(0, len(b_names) - 1, 99999)]
>>> 
>>> %timeit apply_to_bins3(numpy.prod, x, b)
10 loops, best of 3: 109 ms per loop
>>> %timeit apply_to_bins_groupby(numpy.prod, x, b)
1 loops, best of 3: 205 ms per loop

Overall, groupby is probably fine in most cases, but is unlikely to scale well, as suggested by this thread . 总的来说, groupby在大多数情况下可能都很好,但不太可能按照这个线程的建议进行扩展。 Using a pure(er) numpy approach, is slower for small inputs, but only by a bit; 使用纯(er) numpy方法,对于小输入来说速度较慢,但​​只有一点点; the tradeoff is a good one. 权衡是一个很好的权衡。

With pandas groupby this would be 有了pandas groupby就可以了

import pandas as pd

def with_pandas_groupby(func, x, b):
    grouped = pd.Series(x).groupby(b)
    return grouped.agg(func)

Using the example of the OP: 使用OP的例子:

>>> x = [1,2,3,4,5,6]
>>> b = ["a","b","a","a","c","c"]
>>> with_pandas_groupby(np.prod, x, b)
a    12
b     2
c    30

I was just interessted in the speed and so I compared with_pandas_groupby with some functions given in the answer of senderle . 我只是对速度进行了调整,因此我将with_pandas_groupbywith_pandas_groupby的答案中给出的一些函数进行了比较

  • apply_to_bins_groupby apply_to_bins_groupby

      3 levels, 100 values: 175 us per loop 3 levels, 1000 values: 1.16 ms per loop 3 levels, 1000000 values: 1.21 s per loop 10 levels, 100 values: 304 us per loop 10 levels, 1000 values: 1.32 ms per loop 10 levels, 1000000 values: 1.23 s per loop 26 levels, 100 values: 554 us per loop 26 levels, 1000 values: 1.59 ms per loop 26 levels, 1000000 values: 1.27 s per loop 
  • apply_to_bins3 apply_to_bins3

      3 levels, 100 values: 136 us per loop 3 levels, 1000 values: 259 us per loop 3 levels, 1000000 values: 205 ms per loop 10 levels, 100 values: 297 us per loop 10 levels, 1000 values: 447 us per loop 10 levels, 1000000 values: 262 ms per loop 26 levels, 100 values: 617 us per loop 26 levels, 1000 values: 795 us per loop 26 levels, 1000000 values: 299 ms per loop 
  • with_pandas_groupby with_pandas_groupby

      3 levels, 100 values: 365 us per loop 3 levels, 1000 values: 443 us per loop 3 levels, 1000000 values: 89.4 ms per loop 10 levels, 100 values: 369 us per loop 10 levels, 1000 values: 453 us per loop 10 levels, 1000000 values: 88.8 ms per loop 26 levels, 100 values: 382 us per loop 26 levels, 1000 values: 466 us per loop 26 levels, 1000000 values: 89.9 ms per loop 

So pandas is the fastest for large item size. 所以pandas是大项目大小最快的。 Further more the number of levels (bins) has no big influence on computation time. 更多级别(箱)对计算时间没有太大影响。 (Note that the time is calculated starting from numpy arrays and the time to create the pandas.Series is included) (请注意,时间是从numpy数组开始计算的,并且是创建pandas.Series的时间)

I generated the data with: 我生成的数据是:

def gen_data(levels, size):
    choices = 'abcdefghijklmnopqrstuvwxyz'
    levels = np.asarray([l for l in choices[:nlevels]])
    index = np.random.random_integers(0, levels.size - 1, size)
    b = levels[index]
    x = np.arange(1, size + 1)
    return x, b

And then run the benchmark in ipython like this: 然后在ipython运行基准测试,如下所示:

In [174]: for nlevels in (3, 10, 26):
   .....:     for size in (100, 1000, 10e5):
   .....:         x, b = gen_data(nlevels, size)
   .....:         print '%2d levels, ' % nlevels, '%7d values:' % size,
   .....:         %timeit function_to_time(np.prod, x, b)
   .....:     print

In the special case where the aggregation function func can be expressed as a sum, then bincount seems faster than pandas . 在聚合函数func可以表示为和的特殊情况下, bincount似乎比pandas快。 For example when func is the product, it can be expressed as a sum of logarithms and we can do: 例如,当func是产品时,它可以表示为对数之和,我们可以这样做:

x = np.arange( 1000000 )
b = nr.randint( 0, 100, 1000000 )

def apply_to_bincount( values, bins ) :
    logy = np.bincount( bins, weights=np.log( values ) )
    return np.arange(len(logy)), np.exp( logy )

%%timeit
apply_to_bincount( x, b )
10 loops, best of 3: 16.9 ms per loop

%%timeit
with_pandas_groupby( np.prod, x, b )
10 loops, best of 3: 36.2 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM