熊猫数据框多重分组过滤

Question

I have the following dataframe: 我有以下数据框：

df2 = pd.DataFrame({'season':[1,1,1,2,2,2,3,3],'value' : [-2, 3,1,5,8,6,7,5], 'test':[3,2,6,8,7,4,25,2],'test2':[4,5,7,8,9,10,11,12]},index=['2020', '2020', '2020','2020', '2020', '2021', '2021', '2021']) 
df2.index=  pd.to_datetime(df2.index)  
df2.index = df2.index.year
print(df2)

        season  test  test2  value
2020       1     3      4     -2
2020       1     2      5      3
2020       1     6      7      1
2020       2     8      8      5
2020       2     7      9      8
2021       2     4     10      6
2021       3    25     11      7
2021       3     2     12      5

I would like to filter it to obtain for each year and each season of that year the maximum value of the column 'value'. 我想对其进行过滤，以获取该年的每个年份和每个季节的“值”列的最大值。 How can I do that efficiently? 我如何有效地做到这一点？

Expected result: 预期结果：

print(df_result)

        season  value  test  test2
year                            
2020       1      3     2      5
2020       2      8     7      9
2021       2      6     4     10
2021       3      7     25    11

Thank you for your help, 谢谢您的帮助，

Pierre 皮埃尔

Answer 1

This is a groupby operation, but a little non-trivial, so posting as an answer. 这是一个groupby操作，但有点不平凡，因此请发布作为答案。

(df2.set_index('season', append=True)
    .groupby(level=[0, 1])
    .value.max()
    .reset_index(level=1)
)
      season  value
2020       1      4
2020       2      8
2021       2      6
2021       3      7

Answer 2

You can elevate your index to a series, then perform a groupby operation on a list of columns: 您可以将索引提升为一系列，然后对列列表执行groupby操作：

df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])['value'].max().reset_index()

print(df_result)

   year  season  value
0  2020       1      4
1  2020       2      8
2  2021       2      6
3  2021       3      7

If you wish, you can make year your index again via df_result = df_result.set_index('year') . 如果愿意，可以通过df_result = df_result.set_index('year')再次将year索引。

To keep other columns use: 要保留其他列，请使用：

df2['year'] = df2.index
df2['value'] = df2.groupby(['year', 'season'])['value'].transform('max')

Then drop any duplicates via pd.DataFrame.drop_duplicates . 然后通过pd.DataFrame.drop_duplicates删除所有重复pd.DataFrame.drop_duplicates 。

Update #1 更新＃1

For your new requirement, you need to apply an aggregation function for 2 series: 对于新要求，您需要为2个系列应用聚合函数：

df2['year'] = df2.index

df_result = df2.groupby(['year', 'season'])\
               .agg({'value': 'max', 'test': 'last'})\
               .reset_index()

print(df_result)

   year  season  value  test
0  2020       1      4     6
1  2020       2      8     7
2  2021       2      6     2
3  2021       3      7     2

Update #2 更新＃2

For your finalised requirement: 对于您的最终要求：

df2['year'] = df2.index

df2['max_value'] = df2.groupby(['year', 'season'])['value'].transform('max')

df_result = df2.loc[df2['value'] == df2['max_value']]\
               .drop_duplicates(['year', 'season'])\
               .drop('max_value', 1)


print(df_result)

      season  value  test  test2  year
2020       1      3     2      5  2020
2020       2      8     7      9  2020
2021       2      6     4     10  2021
2021       3      7    25     11  2021

Answer 3

You can using get_level_values for bring index value into groupby 您可以使用get_level_values将索引值带入groupby

df2.groupby([df2.index.get_level_values(0),df2.season]).value.max().reset_index(level=1)
Out[38]: 
      season  value
2020       1      4
2020       2      8
2021       2      6
2021       3      7

熊猫数据框多重分组过滤

问题描述

3 个解决方案

解决方案1
3 2018-06-07 16:40:12

解决方案2
2 已采纳 2018-06-07 16:40:20

Update #1 更新＃1

Update #2 更新＃2

解决方案3
1 2018-06-07 17:07:12

熊猫数据框多重分组过滤

问题描述

3 个解决方案

解决方案1 3 2018-06-07 16:40:12

解决方案2 2 已采纳 2018-06-07 16:40:20

Update #1 更新＃1

Update #2 更新＃2

解决方案3 1 2018-06-07 17:07:12

解决方案1
3 2018-06-07 16:40:12

解决方案2
2 已采纳 2018-06-07 16:40:20

解决方案3
1 2018-06-07 17:07:12