简体   繁体   English

Numpy 数组:按一列分组,对另一列求和

[英]Numpy array: group by one column, sum another

I have an array that looks like this:我有一个看起来像这样的数组:

 array([[ 0,  1,  2],
        [ 1,  1,  6],
        [ 2,  2, 10],
        [ 3,  2, 14]])

I want to sum the values of the third column that have the same value in the second column, so the result is something is:我想对第二列中具有相同值的第三列的值求和,所以结果是:

 array([[ 0,  1,  8],
        [ 1,  2, 24]])

I started coding this but I'm stuck with this sum:我开始编写代码,但我坚持这个总和:

import numpy as np
import sys

inFile = sys.argv[1]

with open(inFile, 'r') as t:
    f = np.genfromtxt(t, delimiter=None, names =["1","2","3"])

f.sort(order=["1","2"])
if value == previous.value:
   sum(f["3"])

You can use pandas to vectorize your algorithm:您可以使用pandas来矢量化您的算法:

import pandas as pd, numpy as np

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(A)\
       .groupby(1, as_index=False)\
       .sum()\
       .reset_index()

res = df[['index', 1, 2]].values

Result结果

array([[ 0,  1,  8],
       [ 2,  2, 24]], dtype=int64)

If your data is sorted by the second column, you can use something centered around np.add .如果您的数据按第二列排序,您可以使用以np.add为中心的np.add reduceat for a pure numpy solution. reduceat用于纯 numpy 解决方案。 A combination of np.nonzero (or np.where ) applied to np.diff will give you the locations where the second column switches values.应用于np.diffnp.nonzero (或np.where )的np.diff将为您提供第二列切换值的位置。 You can use those indices to do the sum-reduction.您可以使用这些索引来进行求和。 The other columns are pretty formulaic, so you can concatenate them back in fairly easily:其他列非常公式化,因此您可以很容易地将它们连接起来:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])
# Find the split indices
i = np.nonzero(np.diff(A[:, 1]))[0] + 1
i = np.insert(i, 0, 0)
# Compute the result columns
c0 = np.arange(i.size)
c1 = A[i, 1]
c2 = np.add.reduceat(A[:, 2], i)
# Concatenate the columns
result = np.c_[c0, c1, c2]

IDEOne Link IDEOne链接

Notice the +1 in the indices.注意索引中的 +1。 That is because you always want the location after the switch, not before, given how reduceat works.那是因为考虑到reduceat工作原理,您总是希望在 switch之后而不是之前的位置。 The insertion of zero as the first index could also be accomplished with np.r_ , np.concatenate , etc.零作为第一个索引的插入也可以通过np.r_np.concatenate等来完成。

That being said, I still think you are looking for the pandas version in @jpp's answer .话虽如此,我仍然认为您正在@jpp 的回答中寻找熊猫版本。

A very neat, pure numpy solution is possible using np.histogram :使用np.histogram可以得到一个非常简洁、纯粹的 numpy 解决方案:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

c1 = np.unique(A[:, 1])
c0 = np.arange(c1.shape[0])
c2 = np.histogram(A[:, 1], weights=A[:, 2], bins=c1.shape[0])[0]

result = np.c_[c0, c1, c2]

>>> result
array([[ 0,  1,  8],
       [ 1,  2, 24]])

When a weights array is provided (of the same shape as the input array) to np.histogram , any arbitrary element a[i] in the input array a will contribute weights[i] in the count for its bin.当向np.histogram提供weights数组(与输入数组具有相同的形状)时,输入数组a中的任意元素a[i]将在其 bin 的计数中贡献weights[i]

So for example, we are counting the second column, and instead of counting 2 instances of 2, we get 10 instances of 2 + 14 instances of 2 = a count of 28 in 2's bin.因此,例如,我们正在计算第二列,而不是计算 2 的 2 个实例,我们得到 2 的 10 个实例 + 2 的 14 个实例 = 2 的 bin 中的计数为 28。

Here is my solution using only numpy arrays...这是我仅使用 numpy 数组的解决方案...

import numpy as np
arr = np.array([[ 0,  1,  2], [ 1,  1,  6], [ 2,  2, 10], [ 3,  2, 14]])

lst = []
compt = 0
for index in range(1, max(arr[:, 1]) + 1):
    lst.append([compt, index, np.sum(arr[arr[:, 1] == index][:, 2])])
lst = np.array(lst)
print lst
# lst, outputs...
# [[ 0  1  8]
# [ 0  2 24]]

The tricky part is the np.sum(arr[arr[:, 1] == index][:, 2]) , so let's break it down to multiple parts.棘手的部分是np.sum(arr[arr[:, 1] == index][:, 2]) ,所以让我们把它分解成多个部分。

  • arr[arr[:, 1] == index] means... arr[arr[:, 1] == index]表示...

You have an array arr , on which we ask numpy the rows that matches the value of the for loop.您有一个数组arr ,我们向 numpy 询问与for循环的值匹配的行。 Here, it is set from 1 , to the maximum value of element of the 2nd column (meaning, column with index 1).在这里,它被设置为从1到第 2 列(即索引为 1 的列)的元素的最大值。 Printing only this expression in the for loop results in...在 for 循环中打印此表达式会导致...

# First iteration
[[0 1 2]
 [1 1 6]]
# Second iteration
[[ 2  2 10]
 [ 3  2 14]]
  • Adding [:, 2] to our expression, it means that we want the value of the 3rd column (meaning index 2), of our above lists.[:, 2]添加到我们的表达式中,这意味着我们需要上述列表的第 3 列(即索引 2)的值。 If I print arr[arr[:, 1] == index][:, 2] , it would give me... [2, 6] at first iteration, and [10, 14] at the second.如果我打印arr[arr[:, 1] == index][:, 2] ,它会在第一次迭代时给我... [2, 6]在第二次迭代时给我[10, 14]

  • I just need to sum these values using np.sum() , and to format my output list accordingly.我只需要使用np.sum()对这些值求和,并相应地格式化我的输出列表。 :) :)

Using a dictionary to store the values and then converting back to a list使用字典存储值,然后转换回列表

x = [[ 0,  1,  2],
     [ 1,  1,  6],
     [ 2,  2, 10],
     [ 3,  2, 14]]

y = {}
for val in x:
    if val[1] in y:
        y[val[1]][2] += val[2]
    else:
        y.update({val[1]: val})
print([y[val] for val in y])

To get exact output use pandas :要获得准确的输出,请使用pandas

import pandas as pd
import numpy as np

a = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(a)
df.groupby(1).sum().reset_index().reset_index().as_matrix()
#[[ 0 1  8]
# [ 1 2 24]]

You can also use a defaultdict and sum the values:您还可以使用 defaultdict 并对值求和:

from collections import defaultdict

x = [[ 0,  1,  2],
    [ 1,  1,  6],
    [ 2,  2, 10]]

res = defaultdict(int)
for val in x:
    res[val[1]]+= val[2]
print ([[i, val,res[val]] for i, val in enumerate(res)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM