Pandas Groupby 并比较行以找到最大值

Question

我有一个 dataframe

一个	b	c
一	6	11
一	7	12
二	8	23
二	9	14
三	10	15
三	20	25

我想groupby at column a然后highest value in column c ，以便标记最大值，即

一个	b	c
一	6	11
一	7	12

比较值 11&12，然后

一个	b	c
二	8	23
二	9	14

比较值 23&14，然后

一个	b	c
三	10	15
三	20	25

最终导致：

一个	b	c	旗帜
一	6	11	不
一	7	12	是的
二	8	23	是的
二	9	14	不
三	10	15	不
三	20	25	是的

输入/输出 DF :

df = pd.DataFrame({
    'a':["one","one","two","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,23,14,15,25]
    # , 'flag': ['no', 'yes', 'yes', 'no', 'no', 'yes']
})
df

Answer 1

您可以使用groupby.transform获取每组的最大值，并将numpy.where用于 map 将True / False设置为'yes' / 'no' ：

df['flag'] = np.where(df.groupby('a')['c'].transform('max').eq(df['c']), 'yes', 'no')

output：

       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

中间体：

df.groupby('a')['c'].transform('max')

0    12
1    12
2    23
3    23
4    25
5    25
Name: c, dtype: int64

df.groupby('a')['c'].transform('max').eq(df['c'])
0    False
1     True
2     True
3    False
4    False
5     True
Name: c, dtype: bool

Answer 2

使用GroupBy.transform和max ，比较同一列c然后在numpy.where中设置yes/no ：

df['flag'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')

print(df)
       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

如果每个a最大值的多个值得到多个yes ，如果只需要第一个最大值，请使用DataFrameGroupBy.idxmax并比较df.index ：

df = pd.DataFrame({
    'a':["one","one","one","two","three","three"]
    , 'b':[6,7,8,9,10,20]
    , 'c':[11,12,12,14,15,25]
})

df['flag1'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
df['flag2'] = np.where(df.index == df.groupby('a')['c'].transform('idxmax'), 'yes', 'no')

print(df)

       a   b   c flag1 flag2
0    one   6  11    no    no
1    one   7  12   yes   yes
2    one   8  12   yes    no <- difference for match all max or first max
3    two   9  14   yes   yes
4  three  10  15    no    no
5  three  20  25   yes   yes

Answer 3

一种方法如下

df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df.groupby('a')['c'].max().values and x['a'] == df.groupby('c')['a'].max().loc[x['c']] else 'no', axis=1)

       a   b   c flag
0    one   6  11   no
1    one   7  12  yes
2    two   8  23  yes
3    two   9  14   no
4  three  10  15   no
5  three  20  25  yes

分解上面正在执行的各个步骤

df['flag']创建名为flag的新列。
df.groupby('a')['c'].max()将按列a分组，与pandas.DataFrame.groupby列，并在c列中找到最大值
```
 df2 = df.groupby('a')['c'].max()
```

然后我们检查该值是否在步骤 2 中生成的dataframe中以及组是否相同。

 df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df2.values and x['a'] == df2.loc[x['c']] else 'no', axis=1)

笔记：

检查组是否相同是关键，否则，即使它适用于这种特定情况，如果一个组的非最大值是另一个组的最大值（如 mozway 所述），它将不起作用。
正如jezrael 共享的答案所示， .apply可能会很慢，即使可以工作，它也可能不是最方便的方法。

Pandas Groupby 并比较行以找到最大值

问题描述

3 个解决方案

解决方案1
3 已采纳 2022-09-20 08:08:00

解决方案2
2 2022-09-20 08:08:13

解决方案3
2 2022-09-20 08:13:54

Pandas Groupby 并比较行以找到最大值

问题描述

3 个解决方案

解决方案1 3 已采纳 2022-09-20 08:08:00

解决方案2 2 2022-09-20 08:08:13

解决方案3 2 2022-09-20 08:13:54

解决方案1
3 已采纳 2022-09-20 08:08:00

解决方案2
2 2022-09-20 08:08:13

解决方案3
2 2022-09-20 08:13:54