简体   繁体   中英

Pandas returns incorrect groupby rolling sum of zeros for float64 when having many groups

When doing groupby rolling in pandas with dtype float64, sum of zeros become an arbitrary small float when number of groups is large. For example,

import pandas as pd
import numpy as np

np.random.seed(1)
df = pd.DataFrame({'a': (np.random.random(800)*1e5+1e5).tolist() + [0.0]*800, 'b': list(range(80))*20})
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})

The first line generates a dataframe with 2 columns a and b .

  • Column a has 800 random numbers between 1e5 and 2e5 and 800 zeros.
  • Column b assigns these to 80 groups.

For example for group 79, the df looks like below:

                  a   b
79    158742.001924  79
159   115045.502837  79
239   171582.695286  79
319   181072.123361  79
399   194672.826961  79
479   130100.794308  79
559   169784.165605  79
639   132752.405585  79
719   162355.180105  79
799   148140.045915  79
879        0.000000  79
959        0.000000  79
1039       0.000000  79
1119       0.000000  79
1199       0.000000  79
1279       0.000000  79
1359       0.000000  79
1439       0.000000  79
1519       0.000000  79
1599       0.000000  79

The second line calculates the rolling sum of 5 for column a for each group.

One would expect the rolling sum to be zeros for the last few entries in each group, eg 79. However, arbitrary small floats are returned, eg -5.820766e-11 for group 79 below

                 a
79    1.587420e+05
159   2.737875e+05
239   4.453702e+05
319   6.264423e+05
399   8.211152e+05
479   7.924739e+05
559   8.472126e+05
639   8.083823e+05
719   7.896654e+05
799   7.431326e+05
879   6.130318e+05
959   4.432476e+05
1039  3.104952e+05
1119  1.481400e+05
1199 -5.820766e-11
1279 -5.820766e-11
1359 -5.820766e-11
1439 -5.820766e-11
1519 -5.820766e-11
1599 -5.820766e-11

If we decrease the number of groups to 20 , the issue disappears. Eg

df['b'] = df['b'] = list(range(20))*80
a = df.groupby('b').rolling(5, min_periods=1).agg({'a': 'sum'})

This yields (for group 19, since there are only 20 groups from 0-19)

                  a
19    165083.125668
39    359750.793592
59    485563.758520
79    644305.760443
99    837370.199660
             ...
1519       0.000000
1539       0.000000
1559       0.000000
1579       0.000000
1599       0.000000
[80 rows x 1 columns]

This is only tested on pandas 1.2.5/python 3.7.9/windows 10. You might have to increase the number of groups for this to show up depending on your machine memory.

In my application, I can't really control the number of groups. I can change the dtype to float32 and the issue goes away. But, this causes me to loose precision for large numbers.

Any idea what's causing this and how to resolve it besides using float32 ?

TLDR: this is a side effect of optimization; the workaround is to use a non-pandas sum.

The reason is that pandas tries to optimize. Naive rolling window functions will take O(n*w) time. However, if we're aware the function is a sum, we can subtract one element going out of window and add the one getting into it. This approach no longer depends on window size, and is always O(n).

The caveat is that now we'll get side effects of floating point precision, manifesting itself similar to what you've described.

Sources: Python code calling window aggregation, Cython implementation of the rolling sum

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM