简体   繁体   中英

Right way to use groupby resample aggregate function

I have some data which I'm trying to groupby "name" first and then resample by "transaction_date"

transaction_date    name    revenue
01/01/2020          ADIB    30419
01/01/2020          ADIB    1119372
01/01/2020          ADIB    1272170
01/01/2020          ADIB    43822
01/01/2020          ADIB    24199

The issue i have is writing groupby resample in two different ways return two different results

1-- df.groupby("name").resample("M", on="transaction_date").sum()[['revenue']].head(12)

2-- df.groupby("name").resample("M", on="transaction_date").aggregate({'revenue':'sum'}).head(12)

The first method returns the values I'm looking for.

I don't understand why the two methods return different results. Is this a bug?

Result 1
name    transaction_date    revenue 
ADIB    2020-01-31          39170943.0
        2020-02-29          48003966.0
        2020-03-31          32691641.0
        2020-04-30          11979337.0
        2020-05-31          35510726.0
        2020-06-30          25677857.0
        2020-07-31          12437122.0
        2020-08-31          4348936.0
        2020-09-30          10547188.0
        2020-10-31          5287406.0
        2020-11-30          4288930.0
        2020-12-31          17066105.0

Result 2
name    transaction_date    revenue
ADIB    2020-01-31          64128331.0
        2020-02-29          54450014.0
        2020-03-31          45636192.0
        2020-04-30          25016777.0
        2020-05-31          11941744.0
        2020-06-30          15703151.0
        2020-07-31          5517526.0
        2020-08-31          4092618.0
        2020-09-30          4333433.0
        2020-10-31          3944117.0
        2020-11-30          6528058.0
        2020-12-31          5718196.0

Indeed, it's either a bug or an extremely strange behavior. Consider the following data:

input: 

        date   revenue name
0 2020-10-27  0.744045  n_1
1 2020-10-29  0.074852  n_1
2 2020-11-21  0.560182  n_2
3 2020-12-29  0.208616  n_2
4 2020-05-03  0.325044  n_0

gb = df.groupby("name").resample("M", on="date")

gb.aggregate({'revenue':'sum'})

==>
              revenue
name date                
n_0  2020-12-31  0.325044
n_1  2020-05-31  0.744045
     2020-06-30  0.000000
     2020-07-31  0.000000
     2020-08-31  0.000000
     2020-09-30  0.000000
     2020-10-31  0.074852
n_2  2020-10-31  0.560182
     2020-11-30  0.208616


print(gb.sum()[['revenue']])
==>
                  revenue
name date                
n_0  2020-05-31  0.325044
n_1  2020-10-31  0.818897
n_2  2020-11-30  0.560182
     2020-12-31  0.208616

As one can see, it seems that aggregate produces the wrong results. For example, it takes data from Oct and attaches it to May.

Here's an even simpler example:

Data frame:

        date  revenue name
0 2020-02-24        9  n_1
1 2020-05-12        8  n_2
2 2020-03-28        9  n_2
3 2020-01-14        2  n_0

gb = df.groupby("name").resample("M", on="date")

res1 = gb.sum()[['revenue']]

==>
name date               
n_0  2020-01-31        2
n_1  2020-02-29        9
n_2  2020-03-31        9
     2020-04-30        0
     2020-05-31        8

res2 = gb.aggregate({'revenue':'sum'})

==>
name date               
n_0  2020-05-31        2
n_1  2020-01-31        9
n_2  2020-02-29        8
     2020-03-31        9

I opened a bug about it: https://github.com/pandas-dev/pandas/issues/35173

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM