简体   繁体   English

按组划分的熊猫时间累积总和

[英]Pandas temporal cumulative sum by group

I have a data frame where 1 or more events are recorded for each id.我有一个数据框,其中为每个 id 记录了 1 个或多个事件。 For each event the id, a metric x and a date are recorded.对于每个事件,记录 id、度量 x 和日期。 Something like this:像这样的东西:

import pandas as pd
import datetime as dt
import numpy as np
x = range(0, 6)
id = ['a', 'a', 'b', 'a', 'b', 'b']
dates = [dt.datetime(2012, 5, 2),dt.datetime(2012, 4, 2),dt.datetime(2012, 6, 2),
         dt.datetime(2012, 7, 30),dt.datetime(2012, 4, 1),dt.datetime(2012, 5, 9)]

df =pd.DataFrame(np.column_stack((id,x,dates)), columns = ['id', 'x', 'dates'])

I'd like to be able to set a lookback period (ie 70 days) and calculate, for each row in the dataset, a cumulative sum of x for any preceding event for that id and within the desired lookback (excluding x for the row the calculation is being performed for).我希望能够设置回溯期(即 70 天),并为数据集中的每一行计算该 id 的任何先前事件的 x 的累积总和,并在所需的回溯内(不包括行的 x正在执行计算)。 Should end up looking like:最终应该看起来像:

  id  x                dates    want
0  a  0  2012-05-02 00:00:00    1
1  a  1  2012-04-02 00:00:00    0
2  b  2  2012-06-02 00:00:00    9
3  a  3  2012-07-30 00:00:00    0
4  b  4  2012-04-01 00:00:00    0
5  b  5  2012-05-09 00:00:00    4

Well, one approach is the following: (1) do a groupby/apply with 'id' as grouping variable.好吧,一种方法如下:(1)使用“id”作为分组变量进行分组groupby/apply (2) Within the apply, resample the group to a daily time series. (2) 在应用中,将组resample为每日时间序列。 (3) Then just using rolling_sum (and shift so you don't include the current rows 'x' value) to compute the sum of your 70 day lookback periods. (3) 然后只使用rolling_sum (和shift,这样你就不会包括当前行的'x'值)来计算你的70天回顾期的总和。 (4) Reduce the group back to only the original observations: (4) 将组减少到只有原始观测值:

In [12]: df = df.sort(['id','dates'])
In [13]: df
Out[13]: 
  id  x      dates
1  a  1 2012-04-02
0  a  0 2012-05-02
3  a  3 2012-07-30
4  b  4 2012-04-01
5  b  5 2012-05-09
2  b  2 2012-06-02

You are going to need your data sorted by ['id','dates'] .您将需要按['id','dates']排序的数据。 Now we can do the groupby/apply :现在我们可以执行groupby/apply

In [15]: def past70(g):
             g = g.set_index('dates').resample('D','last')
             g['want'] = pd.rolling_sum(g['x'],70,0).shift(1)
             return g[g.x.notnull()]            

In [16]: df = df.groupby('id').apply(past70).drop('id',axis=1)
In [17]: df
Out[17]: 
               x  want
id dates              
a  2012-04-02  1   NaN
   2012-05-02  0     1
   2012-07-30  3     0
b  2012-04-01  4   NaN
   2012-05-09  5     4
   2012-06-02  2     9

If you don't want the NaNs then just do:如果您不想要 NaN,那么只需执行以下操作:

In [28]: df.fillna(0)
Out[28]: 
               x  want
id dates              
a  2012-04-02  1     0
   2012-05-02  0     1
   2012-07-30  3     0
b  2012-04-01  4     0
   2012-05-09  5     4
   2012-06-02  2     9

Edit: If you want to make the lookback window a parameter do something like the following:编辑:如果您想让回顾窗口成为参数,请执行以下操作:

def past_window(g,win=70):
    g = g.set_index('dates').resample('D','last')
    g['want'] = pd.rolling_sum(g['x'],win,0).shift(1)
    return g[g.x.notnull()]            

df = df.groupby('id').apply(past_window,win=10)
print df.fillna(0)

I needed to perform something similar so I looked a bit and found in pandas' cookbook (which I warmly recommend to anyone willing to learn about all the great possibilities of this package) this page: Pandas: rolling mean by time interval .我需要执行类似的操作,所以我查看了一下,并在 Pandas 的食谱中找到了(我热烈推荐给任何愿意了解这个包的所有巨大可能性的人)这个页面: Pandas:rolling mean by time interval With the latest versions of pandas, you can pass an additional argument that will be used to calculate the window to the rolling() function based on a date_time like column.使用最新版本的 Pandas,您可以传递一个额外的参数,用于根据类似 date_time 的列计算滚动()函数的窗口。 So the example becomes more straightforward:所以这个例子变得更加简单:

# First, convert the dates to date time to make sure it's compatible
df['dates'] = pd.to_datetime(df['dates'])

# Then, sort the time series so that it is monotonic
df.sort_values(['id', 'dates'], inplace=True)

# '70d' corresponds to the the time window we are considering
# The 'closed' parameter indicates whether to include the interval bounds
# 'yearfirst' indicates to pandas the format of your time series
df['want'] = df.groupby('id').rolling('70d', on='dates', closed='neither',
                                      yearfirst=True)['x'].sum().to_numpy()

df['want'] = np.where(df['want'].isnull(), 0, df['want']).astype(int)
df.sort_index() # to dispay it in the same order as the example provided
  id  x      dates  want
0  a  0 2012-05-02     1
1  a  1 2012-04-02     0
2  b  2 2012-06-02     9
3  a  3 2012-07-30     0
4  b  4 2012-04-01     0
5  b  5 2012-05-09     4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM