简体   繁体   English

如何使用 Pandas 按每组中的最大值过滤分组的 DataFrame?

[英]How to filter a grouped DataFrame by maximum values in each group using Pandas?

I hope you are doing well in the current situation我希望你在目前的情况下做得很好

I've the following dataFrame as an input:我有以下 dataFrame 作为输入:

df_0 = pd.DataFrame({"year" : [1960, 1960, 1960, 1960, 1961, 1961, 1961, 1962, 1962, 1962,],
                     "genre": ['Action', 'Crime', 'Action', 'Drama', 'Thriller', 'Thriller', 'Crime', 'Drama', 'Drama', 'Thriller'],
                     "popularity": [1.99, 0.53, 1.81, 0.23, 3.86, 3.94, 0.21, 4.30, 5.60, 0.09] })

figure 0:图0:

        year    genre   popularity
0       1960    Action    1.99
1       1960    Crime     0.53
2       1960    Action    1.81
3       1960    Drama     0.23
4       1961    Thriller  3.86
5       1961    Thriller  3.94
6       1961    Crime     0.21
7       1962    Drama     4.30
8       1962    Drama     5.60
9       1962    Thriller  0.09

I've created a new dataFrame df_1 by grouping by values like this:我通过按如下值分组创建了一个新的 dataFrame df_1

df_1 = df_0.groupby(['year','genre']).popularity.agg(['mean','max'])

figure 1:图1:

                    mean    max
year    genre       
1960    Action      1.90    1.99
        Crime       0.53    0.53
        Drama       0.23    0.23
1961    Crime       0.21    0.21
        Thriller    3.90    3.94
1962    Drama       4.95    5.60
        Thriller    0.09    0.09

As a result, we've got a similar dataFrame as the following:结果,我们得到了一个类似的 dataFrame,如下所示:

df_1 = pd.DataFrame({"year" : [1960, 1960, 1960, 1961, 1961, 1962, 1962,],
                     "genre": ['Action', 'Crime', 'Drama', 'Crime', 'Thriller', 'Drama', 'Thriller'],
                     "mean ": [1.90, 0.53, 0.23, 0.21, 3.90, 4.95, 0.09],
                     "max"  : [1.99, 0.53, 0.23, 0.21, 3.94, 5.60, 0.09] }).set_index("year")

And I'm struggling with the next steps.我正在为接下来的步骤而苦苦挣扎。 I would like to create the following dataFrame df_2 from df_1 (.groupby()) using only pandas functions (and no numpy or at the minimum) :我想从df_1 (.groupby())创建以下 dataFrame df_2仅使用 pandas 函数(并且没有 numpy 或至少)

df_2 = pd.DataFrame({"year" : [1960, 1961, 1962],
                     "genre": ['Action', 'Thriller', 'Drama'],
                     "mean ": [1.90, 3.90, 4.95],
                     "max"  : [1.99, 3.94, 5.60] }).set_index("year")

figure 2:图2:

        genre     mean  max
year            
1960    Action    1.90  1.99
1961    Thriller  3.90  3.94
1962    Drama     4.95  5.60

This dataFrame df_2 collects the maximum values of each group.这个 dataFrame df_2收集了每组的最大值。

Any tips?有小费吗?
Thank you for your support.谢谢您的支持。

Stay safe注意安全

The idmax() function gets the job done: idmax() function 完成了工作:

df_1 = df_0.loc[df_0.groupby('year').popularity.idxmax()].set_index("release_year")

Thx to Phil and Corrodo for their supports.感谢 Phil 和 Corrodo 的支持。

You could try the following:您可以尝试以下方法:

import pandas as pd

# querying the results you want from df_1 and reseting index to turn
# year and genre into columns
df_2 = df_1.query('year in [1960, 1961] and genre in ["Action", "Thriller"]').reset_index()

The result will look like this:结果将如下所示:

   year     genre  mean   max
0  1960    Action   1.9  1.99
1  1961  Thriller   3.9  3.94

A bit long but it works:有点长,但它的工作原理:

df_1.groupby(['year','genre']).max().reset_index().groupby(['genre']).max().reset_index().set_index('year')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM