简体   繁体   English

合并Numpy结构化数组中的记录

[英]Merging records in a Numpy structured array

I have a Numpy structured array that is sorted by the first column: 我有一个Numpy结构化数组,按第一列排序:

x = array([(2, 3), (2, 8), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])

I need to merge records (sum the values of the second column) where 我需要合并记录(总和第二列的值)在哪里

x[n][0] == x[n + 1][0]

In this case, the desired output would be: 在这种情况下,所需的输出将是:

x = array([(2, 11), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])

What's the best way to achieve this? 实现这一目标的最佳方法是什么?

You can use np.unique to get an ID array for each element in the first column and then use np.bincount to perform accumulation on the second column elements based on the IDs - 您可以使用np.unique为第一列中的每个元素获取一个ID数组,然后使用np.bincount根据ID在第二列元素上执行累积 -

In [140]: A
Out[140]: 
array([[25,  1],
       [37,  3],
       [37,  2],
       [47,  1],
       [59,  2]])

In [141]: unqA,idx = np.unique(A[:,0],return_inverse=True)

In [142]: np.column_stack((unqA,np.bincount(idx,A[:,1])))
Out[142]: 
array([[ 25.,   1.],
       [ 37.,   5.],
       [ 47.,   1.],
       [ 59.,   2.]])

You can avoid np.unique with a combination of np.diff and np.cumsum which might help because np.unique also does sorting internally, which is not needed in this case as the input data is already sorted. 你能避免np.unique与组合np.diffnp.cumsum这可能会帮助,因为np.unique也没有内部的排序,这是没有必要在这种情况下,作为输入数据已经排序。 The implementation would look something like this - 实现看起来像这样 -

In [201]: A
Out[201]: 
array([[25,  1],
       [37,  3],
       [37,  2],
       [47,  1],
       [59,  2]])

In [202]: unq1 = np.append(True,np.diff(A[:,0])!=0)

In [203]: np.column_stack((A[:,0][unq1],np.bincount(unq1.cumsum()-1,A[:,1])))
Out[203]: 
array([[ 25.,   1.],
       [ 37.,   5.],
       [ 47.,   1.],
       [ 59.,   2.]])

pandas makes this type of "group-by" operation trivial: pandas使这种类型的“分组”操作变得微不足道:

In [285]: import pandas as pd

In [286]: x = [(25, 1), (37, 3), (37, 2), (47, 1), (59, 2)]

In [287]: df = pd.DataFrame(x)

In [288]: df
Out[288]: 
    0  1
0  25  1
1  37  3
2  37  2
3  47  1
4  59  2

In [289]: df.groupby(0).sum()
Out[289]: 
    1
0    
25  1
37  5
47  1
59  2

You probably won't want the dependency on pandas if this is the only operation you need from it, but once you get started, you might find other useful bits in the library. 如果这是您需要的唯一操作,您可能不希望依赖于pandas,但是一旦开始,您可能会在库中找到其他有用的位。

Dicakar's answer cast in structured array form: Dicakar's答案采用结构化数组形式:

In [500]: x=np.array([(25, 1), (37, 3), (37, 2), (47, 1), (59, 2)], dtype=[('recod', '<u8'), ('count', '<u4')])

Find unique values and count duplicates: 查找唯一值并计算重复项:

In [501]: unqA, idx=np.unique(x['recod'], return_inverse=True)    
In [502]: cnt = np.bincount(idx, x['count'])

Make a new structured array and fill the fields: 创建一个新的结构化数组并填充字段:

In [503]: x1 = np.empty(unqA.shape, dtype=x.dtype)
In [504]: x1['recod'] = unqA
In [505]: x1['count'] = cnt

In [506]: x1
Out[506]: 
array([(25, 1), (37, 5), (47, 1), (59, 2)], 
      dtype=[('recod', '<u8'), ('count', '<u4')])

There is a recarray function that builds an array from a list of arrays: 有一个recarray函数可以从一个数组列表构建一个数组:

In [507]: np.rec.fromarrays([unqA,cnt],dtype=x.dtype)
Out[507]: 
rec.array([(25, 1), (37, 5), (47, 1), (59, 2)], 
      dtype=[('recod', '<u8'), ('count', '<u4')])

Internally it does the same thing - build an empty array of the right size and dtype, and then loop over over the dtype fields. 在内部它做同样的事情 - 构建一个正确大小和dtype的空数组,然后遍历dtype字段。 A recarray is just a structured array in a specialized array subclass wrapper. 重新排列只是专用数组子类包装器中的结构化数组。

There are two ways of populating a structured array (especially with a diverse dtype) - with a list of tuples as you did with x , and field by field. 有两种方法可以填充结构化数组(特别是使用不同的dtype) - 使用x表示元组列表,逐字段表示。

You can use np.reduceat . 你可以使用np.reduceat You just need to populate where x[:, 0] changes which is equivalent to non zero indices of np.diff(x[:,0]) shifted by one plus the initial index 0: 你只需要填充x[:, 0]更改的位置,这相当于np.diff(x[:,0])非零索引移位一加上初始索引0:

>>> i = np.r_[0, 1 + np.nonzero(np.diff(x[:,0]))[0]]
>>> a, b = x[i, 0], np.add.reduceat(x[:, 1], i)
>>> np.vstack((a, b)).T
array([[25,  1],
       [37,  5],
       [47,  1],
       [59,  2]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM