简体   繁体   English

Python Pandas:使用滚动回溯窗口计算累计总和

[英]Python Pandas: Calculate cumulative sum using rolling lookback window

I am calculating the value for the Total Wins (2 Days) column in the (table below) - also see below for comma separated values. 我正在计算(下表)中“ 总获胜次数(2天)”列的值-逗号分隔值也请参见下文。

在此处输入图片说明

Total Wins (2 Days) is a cumulative count of the number of races the athlete has won either on that given day (eg. Day 5) or the day before (eg. Day 4) - as such, it is a count of the number of wins within a 2 day look back window. 总胜利数(2天)是该运动员在该天(例如第5天)或前一天(例如第4天)赢得的比赛次数的累计计数-因此,它是2天后窗内的获胜次数。 (I may want to change the look back window to any number). (我可能想将回顾窗口更改为任何数字)。

For example, on Day 7: Jane gets a count of 1 because she won on Day 7 but lost on Day 6; 例如,在第7天:Jane赢得了1分,因为她在第7天赢了,但在第6天输了; Bill gets a count of 1 because he lost on Day 7 but won on Day 6. Steve didn't win on either day. 比尔得到1的计数,因为他在第7天输了,但在第6天赢了。史蒂夫在任何一天都没有赢。

On Day 6, Bill gets a count of 1 because he won on day Day but lost on Day 5. Steve gets a count of 1 because he lost on Day 6 but won on Day 5. Jane won on neither day. 在第6天,比尔得到1的计数,因为他在第2天赢得了比赛,但在第5天输了。史蒂夫得到了1的计数,因为他在第6天失去了,但是在第5天赢得了。

What is the best way to calculate 'Total Wins (2 Days)' in Python? 用Python计算“总胜利(2天)”的最佳方法是什么?

Follow-up Question 后续问题

Also, as a follow-up question: With regards to the param for '.rolling(2)' (ie. 2 in this case), how would I set the param to be a value that is derived from other columns in the table? 另外,作为后续问题:关于“ .rolling(2)”的参数(即本例中为2),我如何将参数设置为从表中其他列派生的值?

What I want to do is change Race Day to a datetime object (see updated table below) and calculate 'Total Wins (X Days)' as the number of races won over the last week, month, year and so on. 我要做的是将“比赛日”更改为日期时间对象(请参见下表更新),并根据上周,月份,年份等赢得的比赛数来计算“总胜利(X天)”。 (The database I am working with is bigger than the example above). (我正在使用的数据库大于上面的示例)。

I would prefer to define Year as a calendar year (ie. races won between 2014-01-01 and 2015-01-01), rather than simply 265 days. 我宁愿将Year定义为日历年(即在2014年1月1日至2015年1月1日之间赢得的比赛),而不是简单的265天。

在此处输入图片说明

Race Day,Athlete,Position,Total Wins,Total Wins (2 Days)
1,Steve,1,1,1
1,Jane,2,0,0
1,Bill,3,0,0
2,Bill,1,1,1
2,Steve,2,1,1
2,Jane,3,0,0
3,Jane,1,1,1
3,Bill,2,1,1
3,Steve,3,1,0
4,Steve,1,2,1
4,Jane,2,1,1
4,Bill,3,1,0
5,Steve,1,3,2
5,Jane,2,1,0
5,Bill,3,1,0
6,Bill,1,2,1
6,Steve,2,3,1
6,Jane,3,1,0
7,Jane,1,2,1
7,Bill,2,2,1
7,Steve,3,3,0

Race Day,Athlete,Position,Total Wins,Total Wins (2 Months)
2000-01-01,Steve,1,1,?
2000-01-01,Jane,2,0,?
2000-01-01,Bill,3,0,?
2000-01-02,Bill,1,1,?
2000-01-02,Steve,2,1,?
2000-01-02,Jane,3,0,?
2000-01-03,Jane,1,1,?
2000-01-03,Bill,2,1,?
2000-01-03,Steve,3,1,?
2000-01-04,Steve,1,2,?
2000-01-04,Jane,2,1,?
2000-01-04,Bill,3,1,?
2000-01-05,Steve,1,3,?
2000-01-05,Jane,2,1,?
2000-01-05,Bill,3,1,?
2000-01-06,Bill,1,2,?
2000-01-06,Steve,2,3,?
2000-01-06,Jane,3,1,?
2000-01-07,Jane,1,2,?
2000-01-07,Bill,2,2,?
2000-01-07,Steve,3,3,?

Create a Won column that captures Position 1 for each row and then apply rolling sum 创建一个Won列以捕获每一行的位置1,然后应用滚动总和

df['Won'] = (df['Position'] == 1).astype(int)

df['Total Wins (2 Days)'] = df.groupby('Athlete')['Won'].apply(lambda x: x.rolling(2).sum()).combine_first(df['Won'])

    Race Day    Athlete Position    Total Wins  Total Wins (2 Days) Won
0   1           Steve   1           1           1.0                 1
1   1           Jane    2           0           0.0                 0
2   1           Bill    3           0           0.0                 0
3   2           Bill    1           1           1.0                 1
4   2           Steve   2           1           1.0                 0
5   2           Jane    3           0           0.0                 0
6   3           Jane    1           1           1.0                 1
7   3           Bill    2           1           1.0                 0
8   3           Steve   3           1           0.0                 0
9   4           Steve   1           2           1.0                 1
10  4           Jane    2           1           1.0                 0
11  4           Bill    3           1           0.0                 0
12  5           Steve   1           3           2.0                 1
13  5           Jane    2           1           0.0                 0
14  5           Bill    3           1           0.0                 0
15  6           Bill    1           2           1.0                 1
16  6           Steve   2           3           1.0                 0
17  6           Jane    3           1           0.0                 0
18  7           Jane    1           2           1.0                 1
19  7           Bill    2           2           1.0                 0
20  7           Steve   3           3           0.0                 0

You can delete the column Won using 您可以使用

df.drop('Won', axis = 1, inplace = True)

Option 2: 选项2:

df['Won'] = (df['Position'] == 1).astype(int)

df['Total Wins (2 Days)'] = df.groupby('Athlete')['Won'].apply(lambda x: x.add(x.shift())).combine_first(df['Won'])

Edit: For the follow up question with Race Day being date, you can add a column by aggregating the data on day, week, month etc (based on your requirement) and then find the sum of current and previous period. 编辑:对于以“比赛日”为日期的后续问题,您可以通过汇总日,周,月等(根据您的要求)中的数据来添加一列,然后找到当前和上一期间的总和。

df['Race Day'] = pd.to_datetime(df['Race Day'])

df['Won'] = (df['Position'] == 1).astype(int)


df['Total Wins Day']=df.groupby(['Athlete', df['Race Day'].dt.day])['Won'].transform('sum')
df['Total Wins week']=df.groupby(['Athlete', df['Race Day'].dt.week])['Won'].transform('sum')
df['Total Wins Month']=df.groupby(['Athlete', df['Race Day'].dt.month])['Won'].transform('sum')

Now if you want total wins in last two months, you would use the Total Wins Month column and sum that with the previous column 现在,如果您希望最近两个月的总胜利数,可以使用“总胜利数月份”列,并将其与上一列相加

df['Total Wins (2 Months)'] = df.groupby('Athlete')['Total Wins Month'].apply(lambda x: x.add(x.shift())).combine_first(df['Won'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM