pandas gives different numerical results when using apply and agg for a GroupBy object

Question

I found that if we pass np.var to apply , it calculates population variance, but if we pass np.var to agg , it calculates sample variance, as the following example demonstrate:

np.random.seed(1)
df = pd.DataFrame({'category': list("a"*4+"b"*4), 'data': np.arange(8), 'weights': np.random.rand(8)})
df
#   category  data   weights
# 0        a     0  0.417022
# 1        a     1  0.720324
# 2        a     2  0.000114
# 3        a     3  0.302333
# 4        b     4  0.146756
# 5        b     5  0.092339
# 6        b     6  0.186260
# 7        b     7  0.345561

print(df.groupby('category').apply(np.var) ) # population variance
#           data   weights
# category                
# a         1.25  0.066482
# b         1.25  0.008898
print(df.groupby('category').agg(np.var) ) # sample variance
#               data   weights
# category                    
# a         1.666667  0.088643
# b         1.666667  0.011864

Can anyone please tell me why np.var will not give consistent results? Thanks a lot!

Answer 1

You can use ddof to make the answers consistent

print(df.groupby('category').apply(np.var) ) # population variance
          data   weights
category                
a         1.25  0.066482
b         1.25  0.008898

print(df.groupby('category').agg(lambda x: np.var(x, ddof=0)) )  # population variance
          data   weights
category                
a         1.25  0.066482
b         1.25  0.008898


print(df.groupby('category').agg(np.var) ) # sample variance
          data   weights
category                    
a         1.666667  0.088643
b         1.666667  0.011864

print(df.groupby('category').apply(lambda x: np.var(x, ddof=1)) ) # sample variance
              data   weights
category                    
a         1.666667  0.088643
b         1.666667  0.011864

Read more about it innp.var documentation

=================

You can also just directly use.var() of groupby

df.groupby('category').var()

        data    weights
category        
a   1.666667    0.088643
b   1.666667    0.011864

pandas gives different numerical results when using apply and agg for a GroupBy object

Question

1 answers

solution1
0 2021-05-17 01:23:33

pandas gives different numerical results when using apply and agg for a GroupBy object

Question

1 answers

solution1 0 2021-05-17 01:23:33

solution1
0 2021-05-17 01:23:33