简体   繁体   English

在 numpy 数组中按最大值或最小值分组

[英]Group by max or min in a numpy array

I have two equal-length 1D numpy arrays, id and data , where id is a sequence of repeating, ordered integers that define sub-windows on data .我有两个等长的一维 numpy 数组, iddata ,其中id是一个重复的、有序的整数序列,用于定义data子窗口。 For example:例如:

id  data
1     2
1     7
1     3
2     8
2     9
2    10
3     1
3   -10

I would like to aggregate data by grouping on id and taking either the max or the min.我想通过对id分组并取最大值或最小值来聚合data

In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id .在 SQL 中,这将是一个典型的聚合查询,如SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id

Is there a way I can avoid Python loops and do this in a vectorized manner?有没有办法可以避免 Python 循环并以矢量化方式执行此操作?

I've been seeing some very similar questions on stack overflow the last few days.最近几天我看到了一些关于堆栈溢出的非常相似的问题。 The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.下面的代码与 numpy.unique 的实现非常相似,因为它利用了底层的 numpy 机制,它很可能比你在 python 循环中可以做的任何事情都快。

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]

In pure Python:在纯 Python 中:

from itertools import groupby, imap, izip
from operator  import itemgetter as ig

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]

A variation:一种变体:

print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]

Based on @Bago's answer :基于@Bago的回答

import numpy as np

# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]

# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10  1]

If pandas is installed:如果安装了pandas

from pandas import DataFrame

df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1    7
# 2    10
# 3    1

I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufunc s rather than reduceat :我是相当新的Python和NumPy的,但好像你可以使用.at的方法ufunc真是让人不是reduceat

import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)

For example:例如:

data_val = array([ 0.65753453,  0.84279716,  0.88189818,  0.18987882,  0.49800668,
    0.29656994,  0.39542769,  0.43155428,  0.77982853,  0.44955868,
    0.22080219,  0.4807312 ,  0.9288989 ,  0.10956681,  0.73215416,
    0.33184318,  0.10936647])
ans = array([ 0.98969952,  0.84044947,  0.63460516,  0.92042078,  0.75738113,
    0.37976055])

Of course this only makes sense if your data_id values are suitable for use as indices (ie non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).当然,这仅在您的data_id值适合用作索引时才有意义(即非负整数并且不是很大......大概如果它们很大/稀疏,您可以使用np.unique(data_id)或其他东西初始化ans )。

I should point out that the data_id doesn't actually need to be sorted.我应该指出data_id实际上并不需要排序。

Ive packaged a version of my previous answer in the numpy_indexed package;我在numpy_indexed包中打包了我以前的答案的一个版本; its nice to have this all wrapped up and tested in a neat interface;很高兴将所有这些都打包并在一个整洁的界面中进行测试; plus it has a lot more functionality as well:此外,它还具有更多功能:

import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)

And so on等等

with only numpy and without loops:只有 numpy 且没有循环:

id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])

# max
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)

# min
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)

# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])

(pd_group.values == np_group.values).all()  # TRUE

The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups.以下解决方案只需要对数据进行排序(而不是词法排序),不需要查找组之间的边界。 It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o , such that r[[0, 0]] = [1, 2] will return r[0] = 2 .它依赖于这样一个事实,如果or的索引数组,那么r[o] = x将用最新的值x填充r的每个o值,这样r[[0, 0]] = [1, 2]将返回r[0] = 2 It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount , and that there is a value for every group:它要求您的组是从 0 到组数 - 1 的整数,对于numpy.bincount ,并且每个组都有一个值:

def group_min(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)[::-1]
    result[groups.take(order)] = data.take(order)
    return result

def group_max(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)
    result[groups.take(order)] = data.take(order)
    return result

A slightly faster and more general answer than the already accepted one;一个比已经接受的答案稍微快一点和更通用的答案; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs.就像 joeln 的答案一样,它避免了更昂贵的 lexsort,并且它适用于任意 ufunc。 Moreover, it only demands that the keys are sortable, rather than being ints in a specific range.此外,它只要求键是可排序的,而不是特定范围内的整数。 The accepted answer may still be faster though, considering the max/min isn't explicitly computed.考虑到最大值/最小值没有明确计算,接受的答案可能仍然更快。 The ability to ignore nans of the accepted solution is neat;忽略已接受解决方案的 nans 的能力很不错; but one may also simply assign nan values a dummy key.但也可以简单地为 nan 值分配一个虚拟键。

import numpy as np

def group(key, value, operator=np.add):
    """
    group the values by key
    any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
    returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
    """
    #upcast to numpy arrays
    key = np.asarray(key)
    value = np.asarray(value)
    #first, sort by key
    I = np.argsort(key)
    key = key[I]
    value = value[I]
    #the slicing points of the bins to sum over
    slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
    #first entry of each bin is a unique key
    unique_keys = key[slices]
    #reduce over the slices specified by index
    per_key_sum = operator.reduceat(value, slices)
    #number of counts per key is the difference of our slice points. cap off with number of keys for last bin
    key_count = np.diff(np.append(slices, len(key)))
    return unique_keys, per_key_sum, key_count


names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values

I think this accomplishes what you're looking for:我认为这可以完成您正在寻找的内容:

[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]

For the outer list comprehension, from right to left, set(id) groups the id s, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension.对于外部列表理解,从右到左, set(id)id分组, sorted()它们for k ...排序, for k ...迭代它们,在这种情况下, max取另一个列表理解的最大值。 So moving to that inner list comprehension: enumerate(data) returns both the index and value from data , if id[val] == k picks out the data members corresponding to id k .所以转向内部列表理解: enumerate(data)data返回索引和值, if id[val] == k挑选出对应于id kdata成员。

This iterates over the full data list for each id .这将遍历每个id的完整data列表。 With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.通过一些预处理到子列表中,可能会加快速度,但它不会是单行的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM