[英]Pandas Rolling window with filtering condition to remove the some latest data
這是this的后續問題。 我想執行最近 n 天的滾動窗口,但我想從每個窗口中過濾掉最近 x 天(x 小於 n)
這是一個例子:
d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
'Date': ['08/01/2021',
'27/01/2021',
'05/02/2021',
'10/02/2021',
'17/02/2021',
'18/02/2021',
'20/02/2021',
'21/02/2021',
'22/02/2021',
'29/03/2021'],
'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
Name Date Earning
0 Jack 2021-01-08 40
1 Jim 2021-01-27 10
2 Jack 2021-02-05 20
3 Jim 2021-02-10 20
4 Jack 2021-02-17 40
5 Jack 2021-02-18 50
6 Jim 2021-02-20 100
7 Jack 2021-02-21 70
8 Jane 2021-02-22 80
9 Jane 2021-03-29 90
我想
Name
的最后30 days
- 稱之為窗口20 days
(即只取最早的10天)Earning
列上的sum
預期結果:( Window_From
和Window_To
兩列,我只是用它們來演示模擬數據)
Name Date Earning Window_From Window_To Sum
0 Jack 2021-01-08 40 2020-12-09 2020-12-19 0.0
1 Jim 2021-01-27 10 2020-12-28 2021-01-07 0.0
2 Jack 2021-02-05 20 2021-01-06 2021-01-16 40.0
3 Jim 2021-02-10 20 2021-01-11 2021-01-21 0.0
4 Jack 2021-02-17 40 2021-01-18 2021-01-28 0.0
5 Jack 2021-02-18 50 2021-01-19 2021-01-29 0.0
6 Jim 2021-02-20 100 2021-01-21 2021-01-31 10.0
7 Jack 2021-02-21 70 2021-01-22 2021-02-01 0.0
8 Jane 2021-02-22 80 2021-01-23 2021-02-02 0.0
9 Jane 2021-03-29 90 2021-02-27 2021-03-09 0.0
計算 30 天和 20 天的rolling
sum
,然后從 20 天的總和中減去 30 天的總和,得到前 10 天的有效rolling
sum
s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()
df.merge(s1.sub(s2).reset_index(name='sum'), how='left')
Name Date Earning sum
0 Jack 2021-01-08 40 0.0
1 Jim 2021-01-27 10 0.0
2 Jack 2021-02-05 20 40.0
3 Jim 2021-02-10 20 0.0
4 Jack 2021-02-17 40 0.0
5 Jack 2021-02-18 50 0.0
6 Jim 2021-02-20 100 10.0
7 Jack 2021-02-21 70 0.0
8 Jane 2021-02-22 80 0.0
9 Jane 2021-03-29 90 0.0
滾動的替代方法(可能更快):
編輯:實際上使用 OP 的數據集更慢。
df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] + pd.Timedelta(days=10)
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum()
for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]
print(df.reset_index().drop(columns=['start', 'end']))
Name Date Earning Sum
0 Jack 2021-01-08 40 0
1 Jim 2021-01-27 10 0
2 Jack 2021-02-05 20 40
3 Jim 2021-02-10 20 0
4 Jack 2021-02-17 40 0
5 Jack 2021-02-18 50 0
6 Jim 2021-02-20 100 10
7 Jack 2021-02-21 70 0
8 Jane 2021-02-22 80 0
9 Jane 2021-03-29 90 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.