简体   繁体   English

Pandas groupby 具有滚动日期偏移的多列 - 如何?

[英]Pandas groupby multiple columns with rolling date offset - How?

I am trying to do a rolling sum across partitioned data based on a moving 2 business day window.我正在尝试根据移动的 2 个工作日窗口对分区数据进行滚动求和。 It feels like it should be both easy and widely used, but the solution is beyond me.感觉应该既简单又广泛使用,但解决方案超出了我的范围。

#generate sample data
import pandas as pd
import numpy as np
import datetime
vals = [-4,17,-4,-16,2,20,3,10,-17,-8,-21,2,0,-11,16,-24,-10,-21,5,12,14,9,-15,-15]
grp = ['X']*6 + ['Y'] * 6 + ['X']*6 + ['Y'] * 6
typ = ['foo']*12+['bar']*12
dat = ['19/01/18','19/01/18','22/01/18','22/01/18','23/01/18','24/01/18'] * 4
#create dataframe with sample data
df = pd.DataFrame({'group': grp,'type':typ,'value':vals,'date':dat})
df.date = pd.to_datetime(df.date)
df.head(12)

gives the following (note this is just the head 12 rows):给出以下(注意这只是头部 12 行):

    date    group   type    value
0   19/01/2018  X   foo     -4
1   19/01/2018  X   foo     17
2   22/01/2018  X   foo     -4
3   22/01/2018  X   foo     -16
4   23/01/2018  X   foo     2
5   24/01/2018  X   foo     20
6   19/01/2018  Y   foo     3
7   19/01/2018  Y   foo     10
8   22/01/2018  Y   foo     -17
9   22/01/2018  Y   foo     -8
10  23/01/2018  Y   foo     -21
11  24/01/2018  Y   foo     2

The desired results are (all rows shown here):所需的结果是(此处显示的所有行):

    date    group   type    2BD Sum
1   19/01/2018  X   foo     13
2   22/01/2018  X   foo     -7
3   23/01/2018  X   foo     -18
4   24/01/2018  X   foo     22
5   19/01/2018  Y   foo     13
6   22/01/2018  Y   foo     -12
7   23/01/2018  Y   foo     -46
8   24/01/2018  Y   foo     -19
9   19/01/2018  X   bar     -11
10  22/01/2018  X   bar     -19
11  23/01/2018  X   bar     -18
12  24/01/2018  X   bar     -31
13  19/01/2018  Y   bar     17
14  22/01/2018  Y   bar     40
15  23/01/2018  Y   bar     8
16  24/01/2018  Y   bar     -30

I have viewed this question and tried我已经查看了这个问题并尝试过

df.groupby(['group','type']).rolling('2d',on='date').agg({'value':'sum'}
).reset_index().groupby(['group','type','date']).agg({'value':'sum'}).reset_index()

Which would work fine if 'value' is always positive, but this is not the case here.如果 'value' 总是正数,这会很好用,但这里的情况并非如此。 I have tried many other ways that have caused errors that I can list if it is of value.我尝试了许多其他导致错误的方法,如果它有价值,我可以列出。 Can anyone help?任何人都可以帮忙吗?

IIUC, Starting from your code IIUC,从您的代码开始

import pandas as pd
import numpy as np
import datetime
vals = [-4,17,-4,-16,2,20,3,10,-17,-8,-21,2,0,-11,16,-24,-10,-21,5,12,14,9,-15,-15]
grp = ['X']*6 + ['Y'] * 6 + ['X']*6 + ['Y'] * 6
typ = ['foo']*12+['bar']*12
dat = ['19/01/18','19/01/18','22/01/18','22/01/18','23/01/18','24/01/18'] * 4
df = pd.DataFrame({'group': grp,'type':typ,'value':vals,'date':dat})
df.date = pd.to_datetime(df.date)

We start off by grouping by group s, type s and date s and just summing within each day: 我们首先按group分组, type s和date然后在每天内总结:

df2 = df.groupby(["group", "type", "date"]).sum().reset_index().sort_values("date")

Now you can just perform a rolling sum() with min_periods=1 so that your first value is not NaN . 现在你可以用min_periods=1来执行rolling求和(),这样你的第一个值就不是NaN However, you wouldn't 但是,你不会

k = df2.groupby(["group", "type"]).value.rolling(window=2, min_periods=1).sum()

This yields 这产生了

group  type    
X      bar   0    -11.0
             1    -19.0
             2    -18.0
             3    -31.0
       foo   4     13.0
             5     -7.0
             6    -18.0
             7     22.0
Y      bar   8     17.0
             9     40.0
             10     8.0
             11   -30.0
       foo   12    13.0
             13   -12.0
             14   -46.0
             15   -19.0

which is already what you want, but without your date values. 这已经是你想要的,但没有你的date值。 To get the dates, we can do a trick here, which is just change the third level your this Multi-Index obj for your date values in a similar df grouped the same way. 为了得到日期,我们可以在这里做一个技巧,这只是改变你的这个多指数obj的第三级,你的date值在类似的df中以相同的方式分组。 Hence, we can do 因此,我们可以做到

aux = df2.groupby(["group", "type", "date"]).date.rolling(2).count().index.get_level_values(2)

and substitute the index: 并替换索引:

k.index = pd.MultiIndex.from_tuples([(k.index[x][0], k.index[x][1], aux[x]) for x in range(len(k.index))])

Finally, you have your expected output: 最后,您有预期的输出:

k.to_frame()

    group   type    date        value
0   X       bar     2018-01-19  -11.0
1   X       bar     2018-01-22  -19.0
2   X       bar     2018-01-23  -18.0
3   X       bar     2018-01-24  -31.0
4   X       foo     2018-01-19  13.0
5   X       foo     2018-01-22  -7.0
6   X       foo     2018-01-23  -18.0
7   X       foo     2018-01-24  22.0
8   Y       bar     2018-01-19  17.0
9   Y       bar     2018-01-22  40.0
10  Y       bar     2018-01-23  8.0
11  Y       bar     2018-01-24  -30.0
12  Y       foo     2018-01-19  13.0
13  Y       foo     2018-01-22  -12.0
14  Y       foo     2018-01-23  -46.0
15  Y       foo     2018-01-24  -19.0

I expected the following to work:我希望以下内容起作用:

g = lambda ts: ts.rolling('2B', on='date')['value'].sum()
df.groupby(['group', 'type']).apply(g)

However, I get an error as a business day is not a fixed frequency.但是,我收到一个错误,因为工作日不是固定频率。
This brings me to suggesting the following solution, a lot uglier:这让我建议以下解决方案,更难看:

value_per_bday = lambda df: df.resample('B', on='date')['value'].sum()
df = df.groupby(['group', 'type']).apply(value_per_bday).stack()
value_2_bdays = lambda x: x.rolling(2, min_periods=1).sum()
df = df.groupby(axis=0, level=['group', 'type']).apply(value_2_bdays)

Maybe it sounds better with a function, your pick.也许它的功能听起来更好,你的选择。

def resample_and_sum(x):
    x = x.resample('B', on='date')['value'].sum()
    x = x.rolling(2, min_periods=1).sum()
    return x

df = df.groupby(['group', 'type']).apply(resample_and_sum).stack()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM