了解 pandas groupby().agg() 值

Question

I cam across some code for testing Simpson's Paradox, and I'm confused about how it works.我发现了一些用于测试辛普森悖论的代码，但我对它的工作原理感到困惑。

The data is in this form:数据格式如下：

and when I run当我跑步时

gb = df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)])
gb

I get我明白了

I can't fully understand what df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)]) does.我无法完全理解df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)])的作用。

For one, the aggragate data appears to be calculated from the omitted columns in the groupby part, as when I do一方面，聚合数据似乎是从groupby部分中省略的列计算出来的，就像我做的那样

gb = df_.groupby(["recovery", "treatment"]).agg([np.sum, lambda x: len(x)])

I get我明白了

So is that the default behaviour - that the aggregate data is calculated for the missing columns?那么这是默认行为 - 为丢失的列计算聚合数据吗？

I know you can specify columns explicitly in a dictionary, but I'm trying to understand the code as is.我知道您可以在字典中明确指定列，但我试图按原样理解代码。

What exactly is being calculated by the .agg([np.sum, lambda x: len(x)]) ? .agg([np.sum, lambda x: len(x)])究竟计算了什么？

ie what exactly is np.sum being applied to, and likewise lambda x: len(x) ?即np.sum到底应用于什么，同样是lambda x: len(x) ？

Please understand that there may be some conceptual gaps in my understanding that might make what is obvious from the outside non-obvious to me.请理解，在我的理解中可能存在一些概念上的空白，这可能会使从外部显而易见的事情对我来说并不明显。 Any help much appreciated.非常感谢任何帮助。

Answer 1

So is that the default behaviour - that the aggregate data is calculated for the missing columns?那么这是默认行为 - 为丢失的列计算聚合数据吗？

I think yes, if not specify column for processing after groupby pandas use all columns not used in groupby and apply aggregate functions.我认为是的，如果在groupby pandas 之后未指定要处理的列，则使用groupby中未使用的所有列并应用聚合函数。

What exactly is being calculated by the.agg([np.sum, lambda x: len(x)]) the.agg([np.sum, lambda x: len(x)]) 到底在计算什么

Here sum for non numeric columns working like join , for numeric get sum , your custom function lambda x: len(x) return length of groups - numeric and non numeric.这里sum对于像join一样工作的非数字列，对于 numeric get sum ，您的自定义 function lambda x: len(x)返回组的长度 - 数字和非数字。

df_ = pd.DataFrame({
        'kidney_stone_size':list('aaaaaa'),
         'recovery':[4,5,4,5,5,4],
         'col1':[1,3,5,7,1,0],
         'col2':['new'] * 6,
         'treatment':list('aaabbb')
})

df = df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)])
print (df)
                            recovery            col1                  col2  \
                                 sum <lambda_0>  sum <lambda_0>        sum   
kidney_stone_size treatment                                                  
a                 a               13          3    9          3  newnewnew   
                  b               14          3    8          3  newnewnew   

                                        
                            <lambda_0>  
kidney_stone_size treatment             
a                 a                  3  
                  b                  3

But if use only aggregate function working with numeric like sum , pandas by default omit non numeric columns:但是，如果仅使用聚合 function 使用sum类的数字，pandas 默认省略非数字列：

df = df_.groupby(["kidney_stone_size", "treatment"]).sum()
print (df)
                             recovery  col1
kidney_stone_size treatment                
a                 a                13     9
                  b                14     8

了解 pandas groupby().agg() 值

问题描述

1 个解决方案

解决方案1
1 2021-03-09 10:23:49

了解 pandas groupby().agg() 值

问题描述

1 个解决方案

解决方案1 1 2021-03-09 10:23:49

解决方案1
1 2021-03-09 10:23:49