[英]How to filter a grouped DataFrame by maximum values in each group using Pandas?
I hope you are doing well in the current situation我希望你在目前的情况下做得很好
I've the following dataFrame as an input:我有以下 dataFrame 作为输入:
df_0 = pd.DataFrame({"year" : [1960, 1960, 1960, 1960, 1961, 1961, 1961, 1962, 1962, 1962,],
"genre": ['Action', 'Crime', 'Action', 'Drama', 'Thriller', 'Thriller', 'Crime', 'Drama', 'Drama', 'Thriller'],
"popularity": [1.99, 0.53, 1.81, 0.23, 3.86, 3.94, 0.21, 4.30, 5.60, 0.09] })
figure 0:图0:
year genre popularity
0 1960 Action 1.99
1 1960 Crime 0.53
2 1960 Action 1.81
3 1960 Drama 0.23
4 1961 Thriller 3.86
5 1961 Thriller 3.94
6 1961 Crime 0.21
7 1962 Drama 4.30
8 1962 Drama 5.60
9 1962 Thriller 0.09
I've created a new dataFrame df_1
by grouping by values like this:我通过按如下值分组创建了一个新的 dataFrame
df_1
:
df_1 = df_0.groupby(['year','genre']).popularity.agg(['mean','max'])
figure 1:图1:
mean max
year genre
1960 Action 1.90 1.99
Crime 0.53 0.53
Drama 0.23 0.23
1961 Crime 0.21 0.21
Thriller 3.90 3.94
1962 Drama 4.95 5.60
Thriller 0.09 0.09
As a result, we've got a similar dataFrame as the following:结果,我们得到了一个类似的 dataFrame,如下所示:
df_1 = pd.DataFrame({"year" : [1960, 1960, 1960, 1961, 1961, 1962, 1962,],
"genre": ['Action', 'Crime', 'Drama', 'Crime', 'Thriller', 'Drama', 'Thriller'],
"mean ": [1.90, 0.53, 0.23, 0.21, 3.90, 4.95, 0.09],
"max" : [1.99, 0.53, 0.23, 0.21, 3.94, 5.60, 0.09] }).set_index("year")
And I'm struggling with the next steps.我正在为接下来的步骤而苦苦挣扎。 I would like to create the following dataFrame
df_2
from df_1
(.groupby()) using only pandas functions (and no numpy or at the minimum) :我想从
df_1
(.groupby())创建以下 dataFrame df_2
仅使用 pandas 函数(并且没有 numpy 或至少)
df_2 = pd.DataFrame({"year" : [1960, 1961, 1962],
"genre": ['Action', 'Thriller', 'Drama'],
"mean ": [1.90, 3.90, 4.95],
"max" : [1.99, 3.94, 5.60] }).set_index("year")
figure 2:图2:
genre mean max
year
1960 Action 1.90 1.99
1961 Thriller 3.90 3.94
1962 Drama 4.95 5.60
This dataFrame df_2
collects the maximum values of each group.这个 dataFrame
df_2
收集了每组的最大值。
Any tips?有小费吗?
Thank you for your support.谢谢您的支持。
Stay safe注意安全
The idmax() function gets the job done: idmax() function 完成了工作:
df_1 = df_0.loc[df_0.groupby('year').popularity.idxmax()].set_index("release_year")
Thx to Phil and Corrodo for their supports.感谢 Phil 和 Corrodo 的支持。
You could try the following:您可以尝试以下方法:
import pandas as pd
# querying the results you want from df_1 and reseting index to turn
# year and genre into columns
df_2 = df_1.query('year in [1960, 1961] and genre in ["Action", "Thriller"]').reset_index()
The result will look like this:结果将如下所示:
year genre mean max
0 1960 Action 1.9 1.99
1 1961 Thriller 3.9 3.94
A bit long but it works:有点长,但它的工作原理:
df_1.groupby(['year','genre']).max().reset_index().groupby(['genre']).max().reset_index().set_index('year')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.