简体   繁体   English

function 是否有任何 numpy 组?

[英]Is there any numpy group by function?

Is there any function in numpy to group this array down below by the first column? numpy 中是否有任何 function 将这个数组按第一列分组?

I couldn't find any good answer over the internet..我在互联网上找不到任何好的答案..

>>> a
array([[  1, 275],
       [  1, 441],
       [  1, 494],
       [  1, 593],
       [  2, 679],
       [  2, 533],
       [  2, 686],
       [  3, 559],
       [  3, 219],
       [  3, 455],
       [  4, 605],
       [  4, 468],
       [  4, 692],
       [  4, 613]])

Wanted output:想要 output:

array([[[275, 441, 494, 593]],
       [[679, 533, 686]],
       [[559, 219, 455]],
       [[605, 468, 692, 613]]], dtype=object)

Inspired by Eelco Hoogendoorn's library , but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()] )Eelco Hoogendoorn 的 library 的启发,但没有他的 library,并使用数组的第一列总是增加的事实(如果不是,请先使用a = a[a[:, 0].argsort()]

>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I didn't "timeit" but this is probably the faster way to achieve the question :我没有“timeit”,但这可能是解决问题的更快方法:

  • No python native loop没有python本机循环
  • Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed结果列表是 numpy 数组,如果您需要对它们进行其他 numpy 操作,则不需要新的转换
  • Complexity like O(n)复杂度像 O(n)

[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) [编辑] 由于ns63sr 的回答Behzad Shayegh (参见评论),我改进了答案

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. numpy_indexed包(免责声明:我是它的作者)旨在填补 numpy 中的这一空白。 All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library. numpy-indexed 中的所有操作都是完全矢量化的,并且在该库的制作过程中没有损坏 O(n^2) 算法。

import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.请注意,直接计算此类组的相关属性通常更有效(即 group_by(keys).mean(values)),而不是首先拆分为列表/锯齿状数组。

Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects). Numpy 在这里不是很方便,因为所需的输出不是整数数组(它是一个列表对象数组)。

I suggest either the pure Python way...我建议使用纯 Python 方式...

from collections import defaultdict

%%timeit
d = defaultdict(list)
for key, val in a:
    d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result:
defaultdict(list,
        {1: [275, 441, 494, 593],
         2: [679, 533, 686],
         3: [559, 219, 455],
         4: [605, 468, 692, 613]})

...or the pandas way: ...或熊猫方式:

import pandas as pd

%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# result:
key
1    [275, 441, 494, 593]
2         [679, 533, 686]
3         [559, 219, 455]
4    [605, 468, 692, 613]
Name: val, dtype: object
n = np.unique(a[:,0])
np.array( [ list(a[a[:,0]==i,1]) for i in n] )

outputs:输出:

array([[275, 441, 494, 593], [679, 533, 686], [559, 219, 455],
       [605, 468, 692, 613]], dtype=object)

Simplifying the answer of Vincent J and considering the comment of HS-nebula one can use return_index = True instead of return_counts = True and get rid of the cumsum :简化Vincent J答案并考虑到 HS-nebula 的评论,可以使用return_index = True而不是return_counts = True并摆脱cumsum

np.split(a[:,1], np.unique(a[:,0], return_index = True)[1])[1:]

Output输出

[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I used np.unique() followed by np.extract()我使用 np.unique() 后跟 np.extract()

unique = np.unique(a[:, 0:1])
answer = []
for element in unique:
    present = a[:,0]==element
    answer.append(np.extract(present,a[:,-1]))
print (answer)

[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

given X as array of items you want to be grouped and y (1D array) as corresponding groups, following function does the grouping with numpy :给定 X 作为要分组的项目数组,将 y (一维数组)作为相应的组,以下函数使用numpy进行分组:

def groupby(X, y):
    y = np.asarray(y)
    X = np.asarray(X)
    y_uniques = np.unique(y)
    return [X[y==yi] for yi in y_uniques]

So, groupby(a[:,1], a[:,0]) returns [array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]所以, groupby(a[:,1], a[:,0])返回[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

We might also find it useful to generate a dict :我们可能还会发现生成dict很有用:

def groupby(X): 
    X = np.asarray(X) 
    x_uniques = np.unique(X) 
    return {xi:X[X==xi] for xi in x_uniques} 

Let's try it out:让我们试试看:

X=[1,1,2,2,3,3,3,3,4,5,6,7,7,8,9,9,1,1,1]
groupby(X)                                                                                                      
Out[9]: 
{1: array([1, 1, 1, 1, 1]),
 2: array([2, 2]),
 3: array([3, 3, 3, 3]),
 4: array([4]),
 5: array([5]),
 6: array([6]),
 7: array([7, 7]),
 8: array([8]),
 9: array([9, 9])}

Note this by itself is not super compelling - but if we make X an object or namedtuple and then provide a groupby function it becomes more interesting.请注意,这本身并不是非常引人注目 - 但如果我们将X namedtuple一个objectnamedtuple ,然后提供一个groupby函数,它会变得更有趣。 Will put that in later.稍后会放上来。

Late to the party, but anyways.聚会迟到了,但无论如何。 If you plan to not only group the arrays, but also want to do operations on them like sum, mean and so on, and you're doing this with speed in mind, you also might want to consider numpy_groupies .如果您不仅打算对数组进行分组,而且还想对它们进行 sum、mean 等操作,并且考虑到速度,那么您可能还需要考虑numpy_groupies All those group operations are optimized and jitted with numba.所有这些组操作都使用 numba 进行了优化和抖动。 They easily outperform the other mentioned solutions.它们很容易胜过其他提到的解决方案。

from numpy_groupies.aggregate_numpy import aggregate
aggregate(a[:,0], a[:,1], "array", fill_value=[])
>>> array([array([], dtype=int64), array([275, 441, 494, 593]),
           array([679, 533, 686]), array([559, 219, 455]),
           array([605, 468, 692, 613])], dtype=object)
aggregate(a[:,0], a[:,1], "sum")
>>> array([   0, 1803, 1898, 1233, 2378])

It becomes pretty apparent that a = a[a[:, 0].argsort()] is a bottleneck of all the competetive grouping algorithms, big thanks to Vincent J for clarifying this.很明显a = a[a[:, 0].argsort()]是所有竞争分组算法的瓶颈,非常感谢Vincent J澄清这一点。 Over 80% of processing time are just blown up on this argsort method and there's no easy way to replace or optimise it.超过 80% 的处理时间都被这种argsort方法占用了,并且没有简单的方法来替换或优化它。 numba package allows to speed up a lot of algorithms and, hopefully, argsort will attract any efforts in the future. numba包允许加速许多算法,希望argsort将在未来吸引任何努力。 The remaining part of grouping can be improved significantly assuming indices of first column are small.假设第一列的索引很小,分组的其余部分可以显着改进。

TL;DR TL; 博士

The remaining part of majority of grouping methods contains np.unique method which is quite slow and excessive in cases values of groups are small.大多数分组方法的其余部分包含np.unique方法,在组值较小的情况下,该方法非常缓慢且过多。 It's more efficient to replace it with np.bincount which could be later improved in numba .np.bincount替换它更有效,稍后可以在numba改进。 There are some results of how the remaining part could be improved:有一些关于如何改进剩余部分的结果:

def _custom_return(unique_id, a, split_idx, return_groups):
    '''Choose if you want to also return unique ids'''
    if return_groups:
        return unique_id, np.split(a[:,1], split_idx)
    else: 
        return np.split(a[:,1], split_idx)

def numpy_groupby_index(a, return_groups=False):
    '''Code refactor of method of Vincent J'''
    u, idx = np.unique(a[:,0], return_index=True) 
    return _custom_return(u, a, idx[1:], return_groups)

def numpy_groupby_counts(a, return_groups=False):
    '''Use cumsum of counts instead of index'''
    u, counts = np.unique(a[:,0], return_counts=True)
    idx = np.cumsum(counts)
    return _custom_return(u, a, idx[:-1], return_groups)

def numpy_groupby_diff(a, return_groups=False):
    '''No use of any np.unique options'''
    u = np.unique(a[:,0])
    idx = np.flatnonzero(np.diff(a[:,0])) + 1
    return _custom_return(u, a, idx, return_groups)

def numpy_groupby_bins(a, return_groups=False):  
    '''Replace np.unique by np.bincount'''
    bins = np.bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

def numba_groupby_bins(a, return_groups=False):  
    '''Replace np.bincount by numba_bincount'''
    bins = numba_bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

So numba_bincount works in the same way as np.bincount and it's defined like so:所以numba_bincount工作方式与np.bincount相同,它的定义如下:

from numba import njit

@njit
def _numba_bincount(a, counts, m):
    for i in range(m):
        counts[a[i]] += 1

def numba_bincount(arr): #just a refactor of Python count
    M = np.max(arr)
    counts = np.zeros(M + 1, dtype=int)
    _numba_bincount(arr, counts, len(arr))
    return counts

Usage:用法:

a = np.array([[1,275],[1,441],[1,494],[1,593],[2,679],[2,533],[2,686],[3,559],[3,219],[3,455],[4,605],[4,468],[4,692],[4,613]])
a = a[a[:, 0].argsort()]
>>> numpy_groupby_index(a, return_groups=False)
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]
>>> numpy_groupby_index(a, return_groups=True)
(array([1, 2, 3, 4]),
 [array([275, 441, 494, 593]),
  array([679, 533, 686]),
  array([559, 219, 455]),
  array([605, 468, 692, 613])])

Perfmormance tests性能测试

It takes ~30 seconds to sort 100M items on my computer (with 10 distincts IDs).在我的计算机上对 100M 项目(有 10 个不同的 ID)进行排序需要大约 30 秒。 Let's test how much time will methods of the remaining part take to run:让我们测试一下剩余部分的方法需要多少时间来运行:

%matplotlib inline
benchit.setparams(rep=3)

sizes = [3*10**(i//2) if i%2 else 10**(i//2) for i in range(17)]
N = sizes[-1]
x1 = np.random.randint(0,10, size=N)
x2 = np.random.normal(loc=500, scale=200, size=N).astype(int)
a = np.transpose([x1, x2])

arr = a[a[:, 0].argsort()]
fns = [numpy_groupby_index, numpy_groupby_counts, numpy_groupby_diff, numpy_groupby_bins, numba_groupby_bins]
in_ = {s/1000000: (arr[:s], ) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of events')
t.plot(logx=True, figsize=(12, 6), fontsize=14)

在此处输入图片说明

No doubt numba -powered bincount is a new winner of datasets that contains small IDs.毫无疑问, numba驱动的 bincount 是包含小 ID 的数据集的新赢家。 It helps to reduce grouping of sorted data ~5 times which is ~10% of total runtime.它有助于将排序数据的分组减少约 5 次,即总运行时间的约 10%。

Another approach suggested by Ashwini Chaudhary may be what you are looking for. Ashwini Chaudhary建议的另一种方法可能就是您正在寻找的。 Putting it in a simple function把它放在一个简单的 function

def np_groupby(x, index):
    return np.split(x, np.where(np.diff(x[:,index]))[0]+1)

x = numpy array x = numpy 阵列

index = column index索引 = 列索引

[0] + 1 according to Ashwini, ...any thing non-zero means that the item next to it was different, we can use numpy.where to find the indices of non-zero items and then add 1 to it because the actual index of such item is one more than the returned index; [0] + 1 根据 Ashwini, ...任何非零的东西意味着它旁边的项目不同,我们可以使用numpy.where找到非零项目的索引,然后将其加 1,因为该项目的实际索引比返回的索引大一; ...numpy.diff is used to find out where the items actually change. ...numpy.diff 用于找出项目实际更改的位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM