[英]Pandas group by columns and perform aggregate on specific columns
我有下面的 dataframe。
df1 = pd.DataFrame({'col1': ["A", "X", "E", "A", "X", "X", "X"],
'col2': ["B", "Y", "E", "B", "Y","Y","Y"],
'col3': ["C", "Z", "E", "C", "Z", "Z", "Z"],
'col4': ["D", "A", "F", "D","A", "A","A"],
'Sex':["Male","Male","Male","Female","Female","Null","Male"],
'Count':[100,50,100,50,50,10,100],
'Sum_me':[100,200,1,400,300,500,500],
'Avg_me':[ 100,200,1,400,300,500,500]
})
仅按列 col1、col2、col3、col4 过滤重复行后。 Dataframe 如下所示。
columns = ['col1', 'col2', 'col3','col4']
df1 = df1[df1[columns].duplicated(keep=False)].sort_values('col1').reset_index(drop=True)
col1 col2 col3 col4 Sex Count Sum_me Avg_me
0 A B C D Male 100 100 100
1 A B C D Female 50 400 400
2 X Y Z A Male 50 200 200
3 X Y Z A Female 50 300 300
4 X Y Z A Null 10 500 500
5 X Y Z A Male 100 500 500
我正在尝试对 Sum_me 和 Avg_me 列执行聚合,并且我还想通过从匹配性别列的计数列中获取记录来创建一个新列,比如 total_male、total_female 和 null。 total_male_female 是男性、女性和 null 的总和,我尝试了下面的代码,但没有给出预期的结果
result_df = df.groupby(columns).agg({'Sum_me':'sum','Avg_me':'mean'}).reset_index()
下面是我预期的 output。 有没有办法使用 pandas 来做到这一点,任何帮助将不胜感激。
output:
col1 col2 col3 col4 total_male total_female null total_male_female Sum_me Avg_me
A B C D 100 50 0 150 500 250
X Y Z A 150 50 10 210 1500 376
尝试:
x = df1.pivot_table(
index=["col1", "col2", "col3", "col4"],
columns="Sex",
values="Count",
aggfunc="sum",
fill_value=0,
)
g = df1.groupby(["col1", "col2", "col3", "col4"])
out = pd.concat(
[x, g["Sum_me"].sum(), g["Avg_me"].mean()], axis=1
).reset_index()
print(out)
印刷:
col1 col2 col3 col4 Female Male Null Sum_me Avg_me
0 A B C D 50 100 0 500 250.0
1 X Y Z A 50 150 10 1500 375.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.