简体   繁体   中英

group by two columns and cumulative sum with lookback window of 6 months on date

Original dataset

userId     createDate                  grade
0          2016-05-08 22:00:49.673     2
0          2016-07-23 12:37:11.570     7
0          2017-01-03 12:05:33.060     7
1009       2016-06-27 09:28:19.677     5
1009       2016-07-23 12:37:11.570     8
1009       2017-01-03 12:05:33.060     9
1009       2017-02-08 16:17:17.547     4
2011       2016-11-03 14:30:25.390     6
2011       2016-12-15 21:06:14.730     11
2011       2017-01-04 20:22:31.423     2
2011       2017-08-08 16:17:17.547     7

I want rolling sum of grade for each user with lookback window of 6 months from createDate ie (sum of all grades for that under 6 months from create date) Expected:

userId     createDate                 
    0          2016-05-08 22:00:49.673     2
               2016-07-23 12:37:11.570     9
               2017-01-03 12:05:33.060     14
    1009       2016-06-27 09:28:19.677     5
               2016-07-23 12:37:11.570     13
               2017-01-03 12:05:33.060     17
               2017-02-08 16:17:17.547     13
    2011       2016-11-03 14:30:25.390     6
               2016-12-15 21:06:14.730     17
               2017-01-04 20:22:31.423     19
               2017-08-08 16:17:17.547     7

My current try is incorrect :

df.groupby(['userId','createDate'])['grade'].mean().groupby([pd.Grouper(level='userId'),pd.TimeGrouper('6M', level='createDate', closed = 'left')]).cumsum()

It gives me following result:

userId  createDate             
0       2016-05-08 22:00:49.673     2
        2016-07-23 12:37:11.570     9
        2017-01-03 12:05:33.060     7
1009    2016-06-27 09:28:19.677     5
        2016-07-23 12:37:11.570    13
        2017-01-03 12:05:33.060     9
        2017-02-08 16:17:17.547    13
2011    2016-11-03 14:30:25.390     6
        2016-12-15 21:06:14.730    17
        2017-01-04 20:22:31.423    19
        2017-08-08 16:17:17.547     7

Use groupby and rolling sum inside apply with offset of 180D rather than 6 months because number of days in months tends to change every consecutive months. And rolling window must be a constant ie

df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.set_index('createDate').rolling('180D').sum())

                                grade
userId createDate                    
0      2016-05-08 22:00:49.673    2.0
       2016-07-23 12:37:11.570    9.0
       2017-01-03 12:05:33.060   14.0
1009   2016-06-27 09:28:19.677    5.0
       2016-07-23 12:37:11.570   13.0
       2017-01-03 12:05:33.060   17.0
       2017-02-08 16:17:17.547   13.0
2011   2016-11-03 14:30:25.390    6.0
       2016-12-15 21:06:14.730   17.0
       2017-01-04 20:22:31.423   19.0
       2017-08-08 16:17:17.547    7.0

Edit for comment:

To look back 6 months ago the dates need to be sorted. So perhaps you might need sort_values

 df.groupby(['userId'])['createDate','grade'].apply(lambda x : \
            x.sort_values('createDate').set_index('createDate').rolling('180D').sum())

Edit based on @coldspeed's comment:

Using apply is a overkill, set the index first then use rolling sum :

df.set_index('createDate').groupby('userId').grade.rolling('‌​180D').sum() 

Timings:

df = pd.concat([df]*1000)

%%timeit
df.set_index('createDate').groupby('userId').grade.rolling('180D').sum() 
100 loops, best of 3: 7.55 ms per loop

%%timeit
df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.sort_values('createDate').set_index('createDate').rolling('180D').sum())
10 loops, best of 3: 19.5 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM