[英]Understanding pandas groupby().agg() values
I cam across some code for testing Simpson's Paradox, and I'm confused about how it works.我发现了一些用于测试辛普森悖论的代码,但我对它的工作原理感到困惑。
The data is in this form:数据格式如下:
and when I run当我跑步时
gb = df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)])
gb
I get我明白了
I can't fully understand what df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)])
does.我无法完全理解
df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)])
的作用。
For one, the aggragate data appears to be calculated from the omitted columns in the groupby
part, as when I do一方面,聚合数据似乎是从
groupby
部分中省略的列计算出来的,就像我做的那样
gb = df_.groupby(["recovery", "treatment"]).agg([np.sum, lambda x: len(x)])
I get我明白了
So is that the default behaviour - that the aggregate data is calculated for the missing columns?那么这是默认行为 - 为丢失的列计算聚合数据吗?
I know you can specify columns explicitly in a dictionary, but I'm trying to understand the code as is.我知道您可以在字典中明确指定列,但我试图按原样理解代码。
What exactly is being calculated by the .agg([np.sum, lambda x: len(x)])
? .agg([np.sum, lambda x: len(x)])
究竟计算了什么?
ie what exactly is np.sum
being applied to, and likewise lambda x: len(x)
?即
np.sum
到底应用于什么,同样是lambda x: len(x)
?
Please understand that there may be some conceptual gaps in my understanding that might make what is obvious from the outside non-obvious to me.请理解,在我的理解中可能存在一些概念上的空白,这可能会使从外部显而易见的事情对我来说并不明显。 Any help much appreciated.
非常感谢任何帮助。
So is that the default behaviour - that the aggregate data is calculated for the missing columns?
那么这是默认行为 - 为丢失的列计算聚合数据吗?
I think yes, if not specify column for processing after groupby
pandas use all columns not used in groupby
and apply aggregate functions.我认为是的,如果在
groupby
pandas 之后未指定要处理的列,则使用groupby
中未使用的所有列并应用聚合函数。
What exactly is being calculated by the.agg([np.sum, lambda x: len(x)])
the.agg([np.sum, lambda x: len(x)]) 到底在计算什么
Here sum
for non numeric columns working like join
, for numeric get sum
, your custom function lambda x: len(x)
return length of groups - numeric and non numeric.这里
sum
对于像join
一样工作的非数字列,对于 numeric get sum
,您的自定义 function lambda x: len(x)
返回组的长度 - 数字和非数字。
df_ = pd.DataFrame({
'kidney_stone_size':list('aaaaaa'),
'recovery':[4,5,4,5,5,4],
'col1':[1,3,5,7,1,0],
'col2':['new'] * 6,
'treatment':list('aaabbb')
})
df = df_.groupby(["kidney_stone_size", "treatment"]).agg([np.sum, lambda x: len(x)])
print (df)
recovery col1 col2 \
sum <lambda_0> sum <lambda_0> sum
kidney_stone_size treatment
a a 13 3 9 3 newnewnew
b 14 3 8 3 newnewnew
<lambda_0>
kidney_stone_size treatment
a a 3
b 3
But if use only aggregate function working with numeric like sum
, pandas by default omit non numeric columns:但是,如果仅使用聚合 function 使用
sum
类的数字,pandas 默认省略非数字列:
df = df_.groupby(["kidney_stone_size", "treatment"]).sum()
print (df)
recovery col1
kidney_stone_size treatment
a a 13 9
b 14 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.