[英]Pandas Rolling window with filtering condition to remove the some latest data
This is a follow-up question of this .这是this的后续问题。 I would like to perform a rolling window of the last n days but I want to filter out the latest x days from each window (x is smaller than n)我想执行最近 n 天的滚动窗口,但我想从每个窗口中过滤掉最近 x 天(x 小于 n)
Here is an example:这是一个例子:
d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
'Date': ['08/01/2021',
'27/01/2021',
'05/02/2021',
'10/02/2021',
'17/02/2021',
'18/02/2021',
'20/02/2021',
'21/02/2021',
'22/02/2021',
'29/03/2021'],
'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
Name Date Earning
0 Jack 2021-01-08 40
1 Jim 2021-01-27 10
2 Jack 2021-02-05 20
3 Jim 2021-02-10 20
4 Jack 2021-02-17 40
5 Jack 2021-02-18 50
6 Jim 2021-02-20 100
7 Jack 2021-02-21 70
8 Jane 2021-02-22 80
9 Jane 2021-03-29 90
I would like to我想
30 days
of the same Name
- call it a window对于每一行,取Name
的最后30 days
- 称之为窗口20 days
of each window (ie only take the earliest 10 days)去掉每个窗口最近的20 days
(即只取最早的10天)sum
on the Earning
column计算Earning
列上的sum
Expected outcome: (The two columns Window_From
and Window_To
are not needed. I only use them to demonstrate the mock data)预期结果:( Window_From
和Window_To
两列,我只是用它们来演示模拟数据)
Name Date Earning Window_From Window_To Sum
0 Jack 2021-01-08 40 2020-12-09 2020-12-19 0.0
1 Jim 2021-01-27 10 2020-12-28 2021-01-07 0.0
2 Jack 2021-02-05 20 2021-01-06 2021-01-16 40.0
3 Jim 2021-02-10 20 2021-01-11 2021-01-21 0.0
4 Jack 2021-02-17 40 2021-01-18 2021-01-28 0.0
5 Jack 2021-02-18 50 2021-01-19 2021-01-29 0.0
6 Jim 2021-02-20 100 2021-01-21 2021-01-31 10.0
7 Jack 2021-02-21 70 2021-01-22 2021-02-01 0.0
8 Jane 2021-02-22 80 2021-01-23 2021-02-02 0.0
9 Jane 2021-03-29 90 2021-02-27 2021-03-09 0.0
Calculate 30 days and 20 days rolling
sum
then subtract 30 day sum from 20 day sum to get the effective rolling
sum
for first 10 days计算 30 天和 20 天的rolling
sum
,然后从 20 天的总和中减去 30 天的总和,得到前 10 天的有效rolling
sum
s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()
df.merge(s1.sub(s2).reset_index(name='sum'), how='left')
Name Date Earning sum
0 Jack 2021-01-08 40 0.0
1 Jim 2021-01-27 10 0.0
2 Jack 2021-02-05 20 40.0
3 Jim 2021-02-10 20 0.0
4 Jack 2021-02-17 40 0.0
5 Jack 2021-02-18 50 0.0
6 Jim 2021-02-20 100 10.0
7 Jack 2021-02-21 70 0.0
8 Jane 2021-02-22 80 0.0
9 Jane 2021-03-29 90 0.0
An alternative to rolling (may be faster):滚动的替代方法(可能更快):
EDIT: actually slower with OP's dataset.编辑:实际上使用 OP 的数据集更慢。
df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] + pd.Timedelta(days=10)
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum()
for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]
print(df.reset_index().drop(columns=['start', 'end']))
Name Date Earning Sum
0 Jack 2021-01-08 40 0
1 Jim 2021-01-27 10 0
2 Jack 2021-02-05 20 40
3 Jim 2021-02-10 20 0
4 Jack 2021-02-17 40 0
5 Jack 2021-02-18 50 0
6 Jim 2021-02-20 100 10
7 Jack 2021-02-21 70 0
8 Jane 2021-02-22 80 0
9 Jane 2021-03-29 90 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.