简体   繁体   English

当对 GroupBy object 使用 apply 和 agg 时,pandas 给出不同的数值结果

[英]pandas gives different numerical results when using apply and agg for a GroupBy object

I found that if we pass np.var to apply , it calculates population variance, but if we pass np.var to agg , it calculates sample variance, as the following example demonstrate:我发现如果我们将np.var传递给apply ,它会计算总体方差,但如果我们将np.var传递给agg ,它会计算样本方差,如下例所示:

np.random.seed(1)
df = pd.DataFrame({'category': list("a"*4+"b"*4), 'data': np.arange(8), 'weights': np.random.rand(8)})
df
#   category  data   weights
# 0        a     0  0.417022
# 1        a     1  0.720324
# 2        a     2  0.000114
# 3        a     3  0.302333
# 4        b     4  0.146756
# 5        b     5  0.092339
# 6        b     6  0.186260
# 7        b     7  0.345561

print(df.groupby('category').apply(np.var) ) # population variance
#           data   weights
# category                
# a         1.25  0.066482
# b         1.25  0.008898
print(df.groupby('category').agg(np.var) ) # sample variance
#               data   weights
# category                    
# a         1.666667  0.088643
# b         1.666667  0.011864

Can anyone please tell me why np.var will not give consistent results?谁能告诉我为什么 np.var 不会给出一致的结果? Thanks a lot!非常感谢!

You can use ddof to make the answers consistent您可以使用 ddof 使答案保持一致

print(df.groupby('category').apply(np.var) ) # population variance
          data   weights
category                
a         1.25  0.066482
b         1.25  0.008898

print(df.groupby('category').agg(lambda x: np.var(x, ddof=0)) )  # population variance
          data   weights
category                
a         1.25  0.066482
b         1.25  0.008898


print(df.groupby('category').agg(np.var) ) # sample variance
          data   weights
category                    
a         1.666667  0.088643
b         1.666667  0.011864

print(df.groupby('category').apply(lambda x: np.var(x, ddof=1)) ) # sample variance
              data   weights
category                    
a         1.666667  0.088643
b         1.666667  0.011864

Read more about it innp.var documentationnp.var 文档中阅读有关它的更多信息

================= ==================

You can also just directly use.var() of groupby您也可以直接使用 groupby 的 .var()

df.groupby('category').var()

        data    weights
category        
a   1.666667    0.088643
b   1.666667    0.011864

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM