[英]Pandas groupby() and agg() method confusion on columns
Can I check what is the difference between我可以检查一下有什么区别吗
df[['column1', 'column2']].groupby('column1').agg(['mean', 'count'])
and和
df[['column1', 'column2']].groupby('column1').agg({'column2': 'mean', 'column2': 'count'})
In the first example, mean
and count
is performed on column2
which is not in groupby
.在第一个示例中,对不在
groupby
中的column2
执行mean
和count
。
In the second example, same logic but I had explicitly mentioned column2
in agg
.在第二个示例中,逻辑相同,但我在
agg
中明确提到了column2
。
Why do I not see the same result for both?为什么我看不到两者相同的结果?
The problem with the second statement has to due with overwriting the column.第二条语句的问题必须归因于覆盖该列。
There are at least three ways to do this statement.至少有三种方法可以执行此声明。
First let's build a test dataset:首先让我们构建一个测试数据集:
import pandas as pd
from seaborn import load_dataset
df_tips = load_dataset('tips')
df_tips.head()
df_tips[['sex','size']].groupby(['sex']).agg(['mean','count'])
Output: Output:
size
mean count
sex
Male 2.630573 157
Female 2.459770 87
A dataframe with a multiindex column header size and level=1 both aggregations.具有多索引列 header 大小和级别 = 1 的 dataframe 这两个聚合。
df_tips[['sex','size']].groupby(['sex']).agg({'size':['mean','count']})
Output (same as above) Output(同上)
size
mean count
sex
Male 2.630573 157
Female 2.459770 87
df_tips[['sex','size']].groupby(['sex']).agg(mean_size=('size','mean'),count_size=('size','count'))
Output: Output:
mean_size count_size
sex
Male 2.630573 157
Female 2.459770 87
This give a dataframe with a 'flatten' column header that you name yourself, however that name must not contain a space or special characters.这给出了一个 dataframe,其中包含一个您自己命名的“展平”列 header,但是该名称不得包含空格或特殊字符。
df_tips[['sex','size']].groupby(['sex']).agg({'size':'mean','size':'count'})
Outputs:输出:
size
sex
Male 157
Female 87
What is happening here is that you are getting two columns one for each aggregations but the column header is the same 'size', therefore the first iteration is getting overwritten with the second 'count' in this case.这里发生的事情是你得到两列,一个用于每个聚合,但列 header 是相同的“大小”,因此在这种情况下,第一次迭代被第二次“计数”覆盖。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.