按组划分的熊猫时间累积总和

Question

I have a data frame where 1 or more events are recorded for each id.我有一个数据框，其中为每个 id 记录了 1 个或多个事件。 For each event the id, a metric x and a date are recorded.对于每个事件，记录 id、度量 x 和日期。 Something like this:像这样的东西：

import pandas as pd
import datetime as dt
import numpy as np
x = range(0, 6)
id = ['a', 'a', 'b', 'a', 'b', 'b']
dates = [dt.datetime(2012, 5, 2),dt.datetime(2012, 4, 2),dt.datetime(2012, 6, 2),
         dt.datetime(2012, 7, 30),dt.datetime(2012, 4, 1),dt.datetime(2012, 5, 9)]

df =pd.DataFrame(np.column_stack((id,x,dates)), columns = ['id', 'x', 'dates'])

I'd like to be able to set a lookback period (ie 70 days) and calculate, for each row in the dataset, a cumulative sum of x for any preceding event for that id and within the desired lookback (excluding x for the row the calculation is being performed for).我希望能够设置回溯期（即 70 天），并为数据集中的每一行计算该 id 的任何先前事件的 x 的累积总和，并在所需的回溯内（不包括行的 x正在执行计算）。 Should end up looking like:最终应该看起来像：

  id  x                dates    want
0  a  0  2012-05-02 00:00:00    1
1  a  1  2012-04-02 00:00:00    0
2  b  2  2012-06-02 00:00:00    9
3  a  3  2012-07-30 00:00:00    0
4  b  4  2012-04-01 00:00:00    0
5  b  5  2012-05-09 00:00:00    4

Answer 1

Well, one approach is the following: (1) do a groupby/apply with 'id' as grouping variable.好吧，一种方法如下：（1）使用“id”作为分组变量进行分组groupby/apply 。 (2) Within the apply, resample the group to a daily time series. (2) 在应用中，将组resample为每日时间序列。 (3) Then just using rolling_sum (and shift so you don't include the current rows 'x' value) to compute the sum of your 70 day lookback periods. (3) 然后只使用rolling_sum （和shift，这样你就不会包括当前行的'x'值）来计算你的70天回顾期的总和。 (4) Reduce the group back to only the original observations: (4) 将组减少到只有原始观测值：

In [12]: df = df.sort(['id','dates'])
In [13]: df
Out[13]: 
  id  x      dates
1  a  1 2012-04-02
0  a  0 2012-05-02
3  a  3 2012-07-30
4  b  4 2012-04-01
5  b  5 2012-05-09
2  b  2 2012-06-02

You are going to need your data sorted by ['id','dates'] .您将需要按['id','dates']排序的数据。 Now we can do the groupby/apply :现在我们可以执行groupby/apply ：

In [15]: def past70(g):
             g = g.set_index('dates').resample('D','last')
             g['want'] = pd.rolling_sum(g['x'],70,0).shift(1)
             return g[g.x.notnull()]            

In [16]: df = df.groupby('id').apply(past70).drop('id',axis=1)
In [17]: df
Out[17]: 
               x  want
id dates              
a  2012-04-02  1   NaN
   2012-05-02  0     1
   2012-07-30  3     0
b  2012-04-01  4   NaN
   2012-05-09  5     4
   2012-06-02  2     9

If you don't want the NaNs then just do:如果您不想要 NaN，那么只需执行以下操作：

In [28]: df.fillna(0)
Out[28]: 
               x  want
id dates              
a  2012-04-02  1     0
   2012-05-02  0     1
   2012-07-30  3     0
b  2012-04-01  4     0
   2012-05-09  5     4
   2012-06-02  2     9

Edit: If you want to make the lookback window a parameter do something like the following:编辑：如果您想让回顾窗口成为参数，请执行以下操作：

def past_window(g,win=70):
    g = g.set_index('dates').resample('D','last')
    g['want'] = pd.rolling_sum(g['x'],win,0).shift(1)
    return g[g.x.notnull()]            

df = df.groupby('id').apply(past_window,win=10)
print df.fillna(0)

Answer 2

I needed to perform something similar so I looked a bit and found in pandas' cookbook (which I warmly recommend to anyone willing to learn about all the great possibilities of this package) this page: Pandas: rolling mean by time interval .我需要执行类似的操作，所以我查看了一下，并在 Pandas 的食谱中找到了（我热烈推荐给任何愿意了解这个包的所有巨大可能性的人）这个页面： Pandas:rolling mean by time interval 。 With the latest versions of pandas, you can pass an additional argument that will be used to calculate the window to the rolling() function based on a date_time like column.使用最新版本的 Pandas，您可以传递一个额外的参数，用于根据类似 date_time 的列计算滚动（）函数的窗口。 So the example becomes more straightforward:所以这个例子变得更加简单：

# First, convert the dates to date time to make sure it's compatible
df['dates'] = pd.to_datetime(df['dates'])

# Then, sort the time series so that it is monotonic
df.sort_values(['id', 'dates'], inplace=True)

# '70d' corresponds to the the time window we are considering
# The 'closed' parameter indicates whether to include the interval bounds
# 'yearfirst' indicates to pandas the format of your time series
df['want'] = df.groupby('id').rolling('70d', on='dates', closed='neither',
                                      yearfirst=True)['x'].sum().to_numpy()

df['want'] = np.where(df['want'].isnull(), 0, df['want']).astype(int)
df.sort_index() # to dispay it in the same order as the example provided
  id  x      dates  want
0  a  0 2012-05-02     1
1  a  1 2012-04-02     0
2  b  2 2012-06-02     9
3  a  3 2012-07-30     0
4  b  4 2012-04-01     0
5  b  5 2012-05-09     4

按组划分的熊猫时间累积总和

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-05-23 20:06:23

解决方案2
2 2021-04-26 20:14:53

按组划分的熊猫时间累积总和

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-05-23 20:06:23

解决方案2 2 2021-04-26 20:14:53

解决方案1
2 已采纳 2014-05-23 20:06:23

解决方案2
2 2021-04-26 20:14:53