简体   繁体   中英

Is there any numpy group by function?

Is there any function in numpy to group this array down below by the first column?

I couldn't find any good answer over the internet..

>>> a
array([[  1, 275],
       [  1, 441],
       [  1, 494],
       [  1, 593],
       [  2, 679],
       [  2, 533],
       [  2, 686],
       [  3, 559],
       [  3, 219],
       [  3, 455],
       [  4, 605],
       [  4, 468],
       [  4, 692],
       [  4, 613]])

Wanted output:

array([[[275, 441, 494, 593]],
       [[679, 533, 686]],
       [[559, 219, 455]],
       [[605, 468, 692, 613]]], dtype=object)

Inspired by Eelco Hoogendoorn's library , but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()] )

>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I didn't "timeit" but this is probably the faster way to achieve the question :

  • No python native loop
  • Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
  • Complexity like O(n)

[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment)

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.

import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects).

I suggest either the pure Python way...

from collections import defaultdict

%%timeit
d = defaultdict(list)
for key, val in a:
    d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result:
defaultdict(list,
        {1: [275, 441, 494, 593],
         2: [679, 533, 686],
         3: [559, 219, 455],
         4: [605, 468, 692, 613]})

...or the pandas way:

import pandas as pd

%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# result:
key
1    [275, 441, 494, 593]
2         [679, 533, 686]
3         [559, 219, 455]
4    [605, 468, 692, 613]
Name: val, dtype: object
n = np.unique(a[:,0])
np.array( [ list(a[a[:,0]==i,1]) for i in n] )

outputs:

array([[275, 441, 494, 593], [679, 533, 686], [559, 219, 455],
       [605, 468, 692, 613]], dtype=object)

Simplifying the answer of Vincent J and considering the comment of HS-nebula one can use return_index = True instead of return_counts = True and get rid of the cumsum :

np.split(a[:,1], np.unique(a[:,0], return_index = True)[1])[1:]

Output

[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I used np.unique() followed by np.extract()

unique = np.unique(a[:, 0:1])
answer = []
for element in unique:
    present = a[:,0]==element
    answer.append(np.extract(present,a[:,-1]))
print (answer)

[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

given X as array of items you want to be grouped and y (1D array) as corresponding groups, following function does the grouping with numpy :

def groupby(X, y):
    y = np.asarray(y)
    X = np.asarray(X)
    y_uniques = np.unique(y)
    return [X[y==yi] for yi in y_uniques]

So, groupby(a[:,1], a[:,0]) returns [array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

We might also find it useful to generate a dict :

def groupby(X): 
    X = np.asarray(X) 
    x_uniques = np.unique(X) 
    return {xi:X[X==xi] for xi in x_uniques} 

Let's try it out:

X=[1,1,2,2,3,3,3,3,4,5,6,7,7,8,9,9,1,1,1]
groupby(X)                                                                                                      
Out[9]: 
{1: array([1, 1, 1, 1, 1]),
 2: array([2, 2]),
 3: array([3, 3, 3, 3]),
 4: array([4]),
 5: array([5]),
 6: array([6]),
 7: array([7, 7]),
 8: array([8]),
 9: array([9, 9])}

Note this by itself is not super compelling - but if we make X an object or namedtuple and then provide a groupby function it becomes more interesting. Will put that in later.

Late to the party, but anyways. If you plan to not only group the arrays, but also want to do operations on them like sum, mean and so on, and you're doing this with speed in mind, you also might want to consider numpy_groupies . All those group operations are optimized and jitted with numba. They easily outperform the other mentioned solutions.

from numpy_groupies.aggregate_numpy import aggregate
aggregate(a[:,0], a[:,1], "array", fill_value=[])
>>> array([array([], dtype=int64), array([275, 441, 494, 593]),
           array([679, 533, 686]), array([559, 219, 455]),
           array([605, 468, 692, 613])], dtype=object)
aggregate(a[:,0], a[:,1], "sum")
>>> array([   0, 1803, 1898, 1233, 2378])

It becomes pretty apparent that a = a[a[:, 0].argsort()] is a bottleneck of all the competetive grouping algorithms, big thanks to Vincent J for clarifying this. Over 80% of processing time are just blown up on this argsort method and there's no easy way to replace or optimise it. numba package allows to speed up a lot of algorithms and, hopefully, argsort will attract any efforts in the future. The remaining part of grouping can be improved significantly assuming indices of first column are small.

TL;DR

The remaining part of majority of grouping methods contains np.unique method which is quite slow and excessive in cases values of groups are small. It's more efficient to replace it with np.bincount which could be later improved in numba . There are some results of how the remaining part could be improved:

def _custom_return(unique_id, a, split_idx, return_groups):
    '''Choose if you want to also return unique ids'''
    if return_groups:
        return unique_id, np.split(a[:,1], split_idx)
    else: 
        return np.split(a[:,1], split_idx)

def numpy_groupby_index(a, return_groups=False):
    '''Code refactor of method of Vincent J'''
    u, idx = np.unique(a[:,0], return_index=True) 
    return _custom_return(u, a, idx[1:], return_groups)

def numpy_groupby_counts(a, return_groups=False):
    '''Use cumsum of counts instead of index'''
    u, counts = np.unique(a[:,0], return_counts=True)
    idx = np.cumsum(counts)
    return _custom_return(u, a, idx[:-1], return_groups)

def numpy_groupby_diff(a, return_groups=False):
    '''No use of any np.unique options'''
    u = np.unique(a[:,0])
    idx = np.flatnonzero(np.diff(a[:,0])) + 1
    return _custom_return(u, a, idx, return_groups)

def numpy_groupby_bins(a, return_groups=False):  
    '''Replace np.unique by np.bincount'''
    bins = np.bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

def numba_groupby_bins(a, return_groups=False):  
    '''Replace np.bincount by numba_bincount'''
    bins = numba_bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

So numba_bincount works in the same way as np.bincount and it's defined like so:

from numba import njit

@njit
def _numba_bincount(a, counts, m):
    for i in range(m):
        counts[a[i]] += 1

def numba_bincount(arr): #just a refactor of Python count
    M = np.max(arr)
    counts = np.zeros(M + 1, dtype=int)
    _numba_bincount(arr, counts, len(arr))
    return counts

Usage:

a = np.array([[1,275],[1,441],[1,494],[1,593],[2,679],[2,533],[2,686],[3,559],[3,219],[3,455],[4,605],[4,468],[4,692],[4,613]])
a = a[a[:, 0].argsort()]
>>> numpy_groupby_index(a, return_groups=False)
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]
>>> numpy_groupby_index(a, return_groups=True)
(array([1, 2, 3, 4]),
 [array([275, 441, 494, 593]),
  array([679, 533, 686]),
  array([559, 219, 455]),
  array([605, 468, 692, 613])])

Perfmormance tests

It takes ~30 seconds to sort 100M items on my computer (with 10 distincts IDs). Let's test how much time will methods of the remaining part take to run:

%matplotlib inline
benchit.setparams(rep=3)

sizes = [3*10**(i//2) if i%2 else 10**(i//2) for i in range(17)]
N = sizes[-1]
x1 = np.random.randint(0,10, size=N)
x2 = np.random.normal(loc=500, scale=200, size=N).astype(int)
a = np.transpose([x1, x2])

arr = a[a[:, 0].argsort()]
fns = [numpy_groupby_index, numpy_groupby_counts, numpy_groupby_diff, numpy_groupby_bins, numba_groupby_bins]
in_ = {s/1000000: (arr[:s], ) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of events')
t.plot(logx=True, figsize=(12, 6), fontsize=14)

在此处输入图片说明

No doubt numba -powered bincount is a new winner of datasets that contains small IDs. It helps to reduce grouping of sorted data ~5 times which is ~10% of total runtime.

Another approach suggested by Ashwini Chaudhary may be what you are looking for. Putting it in a simple function

def np_groupby(x, index):
    return np.split(x, np.where(np.diff(x[:,index]))[0]+1)

x = numpy array

index = column index

[0] + 1 according to Ashwini, ...any thing non-zero means that the item next to it was different, we can use numpy.where to find the indices of non-zero items and then add 1 to it because the actual index of such item is one more than the returned index; ...numpy.diff is used to find out where the items actually change.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM