当对 GroupBy object 使用 apply 和 agg 时，pandas 给出不同的数值结果

Question

I found that if we pass np.var to apply , it calculates population variance, but if we pass np.var to agg , it calculates sample variance, as the following example demonstrate:我发现如果我们将np.var传递给apply ，它会计算总体方差，但如果我们将np.var传递给agg ，它会计算样本方差，如下例所示：

np.random.seed(1)
df = pd.DataFrame({'category': list("a"*4+"b"*4), 'data': np.arange(8), 'weights': np.random.rand(8)})
df
#   category  data   weights
# 0        a     0  0.417022
# 1        a     1  0.720324
# 2        a     2  0.000114
# 3        a     3  0.302333
# 4        b     4  0.146756
# 5        b     5  0.092339
# 6        b     6  0.186260
# 7        b     7  0.345561

print(df.groupby('category').apply(np.var) ) # population variance
#           data   weights
# category                
# a         1.25  0.066482
# b         1.25  0.008898
print(df.groupby('category').agg(np.var) ) # sample variance
#               data   weights
# category                    
# a         1.666667  0.088643
# b         1.666667  0.011864

Can anyone please tell me why np.var will not give consistent results?谁能告诉我为什么 np.var 不会给出一致的结果？ Thanks a lot!非常感谢！

Answer 1

You can use ddof to make the answers consistent您可以使用 ddof 使答案保持一致

print(df.groupby('category').apply(np.var) ) # population variance
          data   weights
category                
a         1.25  0.066482
b         1.25  0.008898

print(df.groupby('category').agg(lambda x: np.var(x, ddof=0)) )  # population variance
          data   weights
category                
a         1.25  0.066482
b         1.25  0.008898


print(df.groupby('category').agg(np.var) ) # sample variance
          data   weights
category                    
a         1.666667  0.088643
b         1.666667  0.011864

print(df.groupby('category').apply(lambda x: np.var(x, ddof=1)) ) # sample variance
              data   weights
category                    
a         1.666667  0.088643
b         1.666667  0.011864

Read more about it innp.var documentation在np.var 文档中阅读有关它的更多信息

================= ==================

You can also just directly use.var() of groupby您也可以直接使用 groupby 的 .var()

df.groupby('category').var()

        data    weights
category        
a   1.666667    0.088643
b   1.666667    0.011864

当对 GroupBy object 使用 apply 和 agg 时，pandas 给出不同的数值结果

问题描述

1 个解决方案

解决方案1
0 2021-05-17 01:23:33

当对 GroupBy object 使用 apply 和 agg 时，pandas 给出不同的数值结果

问题描述

1 个解决方案

解决方案1 0 2021-05-17 01:23:33

解决方案1
0 2021-05-17 01:23:33