熊猫按2列分组，应用函数，选择最大值并返回索引值

Question

Here is the operation I am trying to do: 这是我要执行的操作：

    ID    SUB_ID    AMOUNT

1   101     1        50
2   101     1        -10
3   101     1        -20
4   101     2        30
5   101     2        20
6   102     3        10
7   102     3        -10
8   102     4        10
9   102     4        10

We want to group by ID and SUB_ID , and then take the sum of the absolute value of AMOUNT . 我们SUB_ID ID和SUB_ID ，然后取AMOUNT的绝对值之和。 Then order this summed up column within ID groups and return the SUB_ID values of the maximum value. 然后，对ID组中的汇总列进行排序，并返回最大值的SUB_ID值。

We can get the summation by: 我们可以通过以下方式求和：

df1 = (df
    .groupby(['ID','SUB_ID'])
    .apply(lambda x: np.sum(np.absolute(x['AMOUNT']))))
)

This will return a Series with MultiIndex 这将返回具有MultiIndex的系列

 ID    SUB_ID    

 101     1        80
         2        50
 102     3        20
         4        20

From here I would like to return [1,3] ([1,4] is also accepted as the two values in the 102 group are the same, but we want to return only one value per group!) 从这里我想返回[1,3]（[1,4]也被接受，因为102组中的两个值相同，但是我们只希望每个组返回一个值！）

Obviously we can loop and pick the max but I am trying to find out the most efficient way possible. 显然，我们可以循环并选择最大值，但我正在尝试找出最有效的方法。 This operation will be applied to millions of rows. 此操作将应用于数百万行。

Answer 1

This is one way. 这是一种方式。 As your dataset is large, I strongly recommend you avoid lambda functions since these are not applied in a vectorised fashion. 由于您的数据集很大，因此我强烈建议您避免使用lambda函数，因为它们不会以矢量化方式应用。

res = df.assign(AMOUNT=df['AMOUNT'].abs())\
        .groupby(['ID', 'SUB_ID'], as_index=False).sum()\
        .sort_values('AMOUNT', ascending=False)\
        .groupby('ID').head(1)

Example 例

df = pd.DataFrame([[101, 1, 50], [101, 1, -10], [101, 1, -20], [101, 2, 30],
                   [101, 2, 20], [102, 3, 10], [102, 3, -10], [102, 4, 10], [102, 4, 10]],
                  columns=['ID', 'SUB_ID', 'AMOUNT'])

res = df.assign(AMOUNT=df['AMOUNT'].abs())\
        .groupby(['ID', 'SUB_ID'], as_index=False).sum()\
        .sort_values('AMOUNT', ascending=False)\
        .groupby('ID').head(1)

print(res)

    ID  SUB_ID  AMOUNT
0  101       1      80
2  102       3      20

Answer 2

I think you can use nlargest : 我认为您可以使用nlargest ：

df1.groupby('ID').nlargest(1).index.get_level_values(level='SUB_ID').tolist()

# [1, 3]

熊猫按2列分组，应用函数，选择最大值并返回索引值

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-05-18 00:10:40

解决方案2
1 2018-05-18 00:11:15

熊猫按2列分组，应用函数，选择最大值并返回索引值

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-05-18 00:10:40

解决方案2 1 2018-05-18 00:11:15

解决方案1
2 已采纳 2018-05-18 00:10:40

解决方案2
1 2018-05-18 00:11:15