简体   繁体   English

Pandas 列上的 groupby() 和 agg() 方法混淆

[英]Pandas groupby() and agg() method confusion on columns

Can I check what is the difference between我可以检查一下有什么区别吗

df[['column1', 'column2']].groupby('column1').agg(['mean', 'count'])

and

df[['column1', 'column2']].groupby('column1').agg({'column2': 'mean', 'column2': 'count'})

In the first example, mean and count is performed on column2 which is not in groupby .在第一个示例中,对不在groupby中的column2执行meancount

In the second example, same logic but I had explicitly mentioned column2 in agg .在第二个示例中,逻辑相同,但我在agg中明确提到了column2

Why do I not see the same result for both?为什么我看不到两者相同的结果?

TLDR TLDR

The problem with the second statement has to due with overwriting the column.第二条语句的问题必须归因于覆盖该列。


There are at least three ways to do this statement.至少有三种方法可以执行此声明。

First let's build a test dataset:首先让我们构建一个测试数据集:

import pandas as pd
from seaborn import load_dataset

df_tips = load_dataset('tips')

df_tips.head()
Statement 1: Same as your first wy声明 1:与您的第一个 wy 相同
df_tips[['sex','size']].groupby(['sex']).agg(['mean','count'])

Output: Output:

            size      
            mean count
sex                   
Male    2.630573   157
Female  2.459770    87

A dataframe with a multiindex column header size and level=1 both aggregations.具有多索引列 header 大小和级别 = 1 的 dataframe 这两个聚合。

Statement 2: Using a list of aggregrations for 'size' in a dictionary语句 2:在字典中使用“大小”的聚合列表
df_tips[['sex','size']].groupby(['sex']).agg({'size':['mean','count']})

Output (same as above) Output(同上)

            size      
            mean count
sex                   
Male    2.630573   157
Female  2.459770    87
Statement 3: Using named aggregrations语句 3:使用命名聚合
df_tips[['sex','size']].groupby(['sex']).agg(mean_size=('size','mean'),count_size=('size','count'))

Output: Output:

        mean_size  count_size
sex                          
Male     2.630573         157
Female   2.459770          87

This give a dataframe with a 'flatten' column header that you name yourself, however that name must not contain a space or special characters.这给出了一个 dataframe,其中包含一个您自己命名的“展平”列 header,但是该名称不得包含空格或特殊字符。

The incorrect way is your second method不正确的方法是你的第二种方法
df_tips[['sex','size']].groupby(['sex']).agg({'size':'mean','size':'count'})

Outputs:输出:

        size
sex         
Male     157
Female    87

What is happening here is that you are getting two columns one for each aggregations but the column header is the same 'size', therefore the first iteration is getting overwritten with the second 'count' in this case.这里发生的事情是你得到两列,一个用于每个聚合,但列 header 是相同的“大小”,因此在这种情况下,第一次迭代被第二次“计数”覆盖。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM