簡體   English   中英

帶有過濾條件的熊貓滾動窗口以刪除一些最新數據

[英]Pandas Rolling window with filtering condition to remove the some latest data

這是this的后續問題。 我想執行最近 n 天的滾動窗口,但我想從每個窗口中過濾掉最近 x 天(x 小於 n)

這是一個例子:

d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
     'Date': ['08/01/2021',
              '27/01/2021',
              '05/02/2021',
              '10/02/2021',
              '17/02/2021',
              '18/02/2021',
              '20/02/2021',
              '21/02/2021',
              '22/02/2021',
              '29/03/2021'],
     'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}

df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
   Name       Date  Earning
0  Jack 2021-01-08       40
1   Jim 2021-01-27       10
2  Jack 2021-02-05       20
3   Jim 2021-02-10       20
4  Jack 2021-02-17       40
5  Jack 2021-02-18       50
6   Jim 2021-02-20      100
7  Jack 2021-02-21       70
8  Jane 2021-02-22       80
9  Jane 2021-03-29       90

我想

  • 對於每一行,取Name的最后30 days - 稱之為窗口
  • 去掉每個窗口最近的20 days (即只取最早的10天)
  • 計算Earning列上的sum

預期結果:( Window_FromWindow_To兩列,我只是用它們來演示模擬數據)

   Name       Date  Earning Window_From  Window_To   Sum
0  Jack 2021-01-08       40  2020-12-09 2020-12-19   0.0
1   Jim 2021-01-27       10  2020-12-28 2021-01-07   0.0
2  Jack 2021-02-05       20  2021-01-06 2021-01-16  40.0
3   Jim 2021-02-10       20  2021-01-11 2021-01-21   0.0
4  Jack 2021-02-17       40  2021-01-18 2021-01-28   0.0
5  Jack 2021-02-18       50  2021-01-19 2021-01-29   0.0
6   Jim 2021-02-20      100  2021-01-21 2021-01-31  10.0
7  Jack 2021-02-21       70  2021-01-22 2021-02-01   0.0
8  Jane 2021-02-22       80  2021-01-23 2021-02-02   0.0
9  Jane 2021-03-29       90  2021-02-27 2021-03-09   0.0

簡單的解決方案

計算 30 天和 20 天的rolling sum ,然后從 20 天的總和中減去 30 天的總和,得到前 10 天的有效rolling sum

s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()

df.merge(s1.sub(s2).reset_index(name='sum'), how='left')

   Name       Date  Earning   sum
0  Jack 2021-01-08       40   0.0
1   Jim 2021-01-27       10   0.0
2  Jack 2021-02-05       20  40.0
3   Jim 2021-02-10       20   0.0
4  Jack 2021-02-17       40   0.0
5  Jack 2021-02-18       50   0.0
6   Jim 2021-02-20      100  10.0
7  Jack 2021-02-21       70   0.0
8  Jane 2021-02-22       80   0.0
9  Jane 2021-03-29       90   0.0

滾動的替代方法(可能更快):

編輯:實際上使用 OP 的數據集更慢。

df['start'] = df['Date'] - pd.Timedelta(days=30)
df['end'] = df['start'] + pd.Timedelta(days=10) 
df = df.set_index(['Name', 'Date'])
df['Sum'] = [df.xs(n, level=0).loc[start:end, 'Earning'].sum() 
             for n, start, end in zip(df.index.get_level_values(0), df['start'], df['end'])]

print(df.reset_index().drop(columns=['start', 'end']))
   Name       Date  Earning  Sum
0  Jack 2021-01-08       40    0
1   Jim 2021-01-27       10    0
2  Jack 2021-02-05       20   40
3   Jim 2021-02-10       20    0
4  Jack 2021-02-17       40    0
5  Jack 2021-02-18       50    0
6   Jim 2021-02-20      100   10
7  Jack 2021-02-21       70    0
8  Jane 2021-02-22       80    0
9  Jane 2021-03-29       90    0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM