I found that if we pass np.var
to apply
, it calculates population variance, but if we pass np.var
to agg
, it calculates sample variance, as the following example demonstrate:
np.random.seed(1)
df = pd.DataFrame({'category': list("a"*4+"b"*4), 'data': np.arange(8), 'weights': np.random.rand(8)})
df
# category data weights
# 0 a 0 0.417022
# 1 a 1 0.720324
# 2 a 2 0.000114
# 3 a 3 0.302333
# 4 b 4 0.146756
# 5 b 5 0.092339
# 6 b 6 0.186260
# 7 b 7 0.345561
print(df.groupby('category').apply(np.var) ) # population variance
# data weights
# category
# a 1.25 0.066482
# b 1.25 0.008898
print(df.groupby('category').agg(np.var) ) # sample variance
# data weights
# category
# a 1.666667 0.088643
# b 1.666667 0.011864
Can anyone please tell me why np.var will not give consistent results? Thanks a lot!
You can use ddof to make the answers consistent
print(df.groupby('category').apply(np.var) ) # population variance
data weights
category
a 1.25 0.066482
b 1.25 0.008898
print(df.groupby('category').agg(lambda x: np.var(x, ddof=0)) ) # population variance
data weights
category
a 1.25 0.066482
b 1.25 0.008898
print(df.groupby('category').agg(np.var) ) # sample variance
data weights
category
a 1.666667 0.088643
b 1.666667 0.011864
print(df.groupby('category').apply(lambda x: np.var(x, ddof=1)) ) # sample variance
data weights
category
a 1.666667 0.088643
b 1.666667 0.011864
Read more about it innp.var documentation
=================
You can also just directly use.var() of groupby
df.groupby('category').var()
data weights
category
a 1.666667 0.088643
b 1.666667 0.011864
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.