简体   繁体   English

熊猫按2列分组,应用函数,选择最大值并返回索引值

[英]Pandas group by 2 columns, apply function, select max value and return index values

Here is the operation I am trying to do: 这是我要执行的操作:

    ID    SUB_ID    AMOUNT

1   101     1        50
2   101     1        -10
3   101     1        -20
4   101     2        30
5   101     2        20
6   102     3        10
7   102     3        -10
8   102     4        10
9   102     4        10

We want to group by ID and SUB_ID , and then take the sum of the absolute value of AMOUNT . 我们SUB_ID IDSUB_ID ,然后取AMOUNT的绝对值之和。 Then order this summed up column within ID groups and return the SUB_ID values of the maximum value. 然后,对ID组中的汇总列进行排序,并返回最大值的SUB_ID值。

We can get the summation by: 我们可以通过以下方式求和:

df1 = (df
    .groupby(['ID','SUB_ID'])
    .apply(lambda x: np.sum(np.absolute(x['AMOUNT']))))
)

This will return a Series with MultiIndex 这将返回具有MultiIndex的系列

 ID    SUB_ID    

 101     1        80
         2        50
 102     3        20
         4        20

From here I would like to return [1,3] ([1,4] is also accepted as the two values in the 102 group are the same, but we want to return only one value per group!) 从这里我想返回[1,3]([1,4]也被接受,因为102组中的两个值相同,但是我们只希望每个组返回一个值!)

Obviously we can loop and pick the max but I am trying to find out the most efficient way possible. 显然,我们可以循环并选择最大值,但我正在尝试找出最有效的方法。 This operation will be applied to millions of rows. 此操作将应用于数百万行。

This is one way. 这是一种方式。 As your dataset is large, I strongly recommend you avoid lambda functions since these are not applied in a vectorised fashion. 由于您的数据集很大,因此我强烈建议您避免使用lambda函数,因为它们不会以矢量化方式应用。

res = df.assign(AMOUNT=df['AMOUNT'].abs())\
        .groupby(['ID', 'SUB_ID'], as_index=False).sum()\
        .sort_values('AMOUNT', ascending=False)\
        .groupby('ID').head(1)

Example

df = pd.DataFrame([[101, 1, 50], [101, 1, -10], [101, 1, -20], [101, 2, 30],
                   [101, 2, 20], [102, 3, 10], [102, 3, -10], [102, 4, 10], [102, 4, 10]],
                  columns=['ID', 'SUB_ID', 'AMOUNT'])

res = df.assign(AMOUNT=df['AMOUNT'].abs())\
        .groupby(['ID', 'SUB_ID'], as_index=False).sum()\
        .sort_values('AMOUNT', ascending=False)\
        .groupby('ID').head(1)

print(res)

    ID  SUB_ID  AMOUNT
0  101       1      80
2  102       3      20

I think you can use nlargest : 我认为您可以使用nlargest

df1.groupby('ID').nlargest(1).index.get_level_values(level='SUB_ID').tolist()

# [1, 3]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM