简体   繁体   English

使用 pandas.DataFrame.groupby 从每组中获取最大值

[英]Get the max value from each group with pandas.DataFrame.groupby

I need to aggregate two columns of my dataframe, count the values of the second columns and then take only the row with the highest value in the "count" column, let me show:我需要聚合我的 dataframe 的两列,计算第二列的值,然后只取“计数”列中具有最高值的行,让我展示一下:

df =
col1|col2
---------
  A | AX
  A | AX
  A | AY
  A | AY
  A | AY
  B | BX
  B | BX
  B | BX
  B | BY
  B | BY
  C | CX
  C | CX
  C | CX
  C | CX
  C | CX
------------

df1 = df.groupby(['col1', 'col2']).agg({'col2': 'count'})
df1.columns = ['count']
df1= df1.reset_index()

out:
col1 col2 count
A    AX   2
A    AY   3
B    BX   3
B    BY   2
C    CX   5

so far so good, but now I need to get only the row of each 'col1' group that has the maximum 'count' value, but keeping the value in 'col2'.到目前为止一切顺利,但现在我只需要获取每个“col1”组中具有最大“count”值的行,但将值保留在“col2”中。

expected output in the end:

col1 col2 count
  A  AY   3
  B  BX   3
  C  CX   5

I have no idea how to do that.我不知道该怎么做。 My attempts so far of using the max() aggregation always left the 'col2' out.到目前为止,我尝试使用 max() 聚合的尝试总是将“col2”排除在外。

From your original DataFrame you can .value_counts , which returns a descending count within group, and then given this sorting drop_duplicates will keep the most frequent within group.从你原来的 DataFrame 你可以.value_counts ,它返回组内的递减计数,然后给定这个排序drop_duplicates将保持组内最频繁。

df1 = (df.groupby('col1')['col2'].value_counts()
         .rename('counts').reset_index()
         .drop_duplicates('col1'))

  col1 col2  counts
0    A   AY       3
2    B   BX       3
4    C   CX       5

Probably not ideal, but this works:可能不理想,但这有效:

df1.loc[df1.groupby(level=0).idxmax()['count']]
col1    col2    count
A       AY      3
B       BX      3
C       CX      5

This works because the groupby within the loc will return a list of indices, which loc will then pull up.这是有效的,因为loc中的 groupby 将返回一个索引列表,然后loc将向上拉。

I guess you need this: df['qty'] = 1 and then df.groupby([['col1', 'col2']].sum().reset_index(drop=True)我猜你需要这个: df['qty'] = 1 然后 df.groupby([['col1', 'col2']].sum().reset_index(drop=True)

Option 1: Include Ties选项 1:包括领带

In case you have ties and want to show them.如果您有领带并想展示它们。

Ties could be, for instance, both (B, BX) and (B, BY) occur 3 times.例如,关系可以是 (B, BX) 和 (B, BY) 都出现 3 次。

# Prepare packages
import pandas as pd

# Create dummy date
df = pd.DataFrame({
    'col1': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
    'col2': ['AX', 'AX', 'AY', 'AY', 'AY', 'BX', 'BX', 'BX', 'BY', 'BY', 'BY', 'CX', 'CX', 'CX', 'CX', 'CX'],
})

# Get Max Value by Group with Ties
df_count = (df
            .groupby('col1')['col2']
            .value_counts()
            .to_frame('count')
            .reset_index())
m = df_count.groupby(['col1'])['count'].transform(max) == df_count['count']
df1 = df_count[m]
  col1 col2  count
0    A   AY      3
2    B   BX      3
3    B   BY      3
4    C   CX      5

Option 2: Short Code Ignoring Ties选项 2:短代码忽略关系

df1 = (df
 .groupby('col1')['col2']
 .value_counts()
 .groupby(level=0)
 .head(1)
 # .to_frame('count').reset_index() # Uncomment to get exact output requested
 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM