[英]pandas gives different numerical results when using apply and agg for a GroupBy object
I found that if we pass np.var
to apply
, it calculates population variance, but if we pass np.var
to agg
, it calculates sample variance, as the following example demonstrate:我发现如果我们将
np.var
传递给apply
,它会计算总体方差,但如果我们将np.var
传递给agg
,它会计算样本方差,如下例所示:
np.random.seed(1)
df = pd.DataFrame({'category': list("a"*4+"b"*4), 'data': np.arange(8), 'weights': np.random.rand(8)})
df
# category data weights
# 0 a 0 0.417022
# 1 a 1 0.720324
# 2 a 2 0.000114
# 3 a 3 0.302333
# 4 b 4 0.146756
# 5 b 5 0.092339
# 6 b 6 0.186260
# 7 b 7 0.345561
print(df.groupby('category').apply(np.var) ) # population variance
# data weights
# category
# a 1.25 0.066482
# b 1.25 0.008898
print(df.groupby('category').agg(np.var) ) # sample variance
# data weights
# category
# a 1.666667 0.088643
# b 1.666667 0.011864
Can anyone please tell me why np.var will not give consistent results?谁能告诉我为什么 np.var 不会给出一致的结果? Thanks a lot!
非常感谢!
You can use ddof to make the answers consistent您可以使用 ddof 使答案保持一致
print(df.groupby('category').apply(np.var) ) # population variance
data weights
category
a 1.25 0.066482
b 1.25 0.008898
print(df.groupby('category').agg(lambda x: np.var(x, ddof=0)) ) # population variance
data weights
category
a 1.25 0.066482
b 1.25 0.008898
print(df.groupby('category').agg(np.var) ) # sample variance
data weights
category
a 1.666667 0.088643
b 1.666667 0.011864
print(df.groupby('category').apply(lambda x: np.var(x, ddof=1)) ) # sample variance
data weights
category
a 1.666667 0.088643
b 1.666667 0.011864
Read more about it innp.var documentation在np.var 文档中阅读有关它的更多信息
================= ==================
You can also just directly use.var() of groupby您也可以直接使用 groupby 的 .var()
df.groupby('category').var()
data weights
category
a 1.666667 0.088643
b 1.666667 0.011864
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.