I have a dataset where each record contains match-level data such as MATCH_DATE | PLAYER1 | PLAYER2 | TOURNAMENT | SURFACE | PLAYER1_SERVE% | PLAYER2_SERVE%
09DEC2020 | Mike | Jim | Rome Open | Clay | 65% | 70%
I'm trying to create new columns that are rolling time-window based per "PLAYER AND SURFACE", eg, LAST90DAYS_PLAYER1_CLAYSERVE% and LAST5MATCHES_PLAYER1_CLAYSERVE%. Note that those two fields should be for the same SURFACE specified in the subject matter record.
I then need to add/append those new columns to the original dataset to arrive at a final dataset like DATE | PLAYER1 | PLAYER2 | TOURNAMENT | SURFACE | PLAYER1_SERVE% | PLAYER2_SERVE% | LAST90DAYS_PLAYER1_CLAYSERVE% | LAST5MATCHES_PLAYER1_CLAYSERVE%
09DEC2020 | Mike | Jim | Rome Open | Clay | 65% | 70% | 62.5% | 69.2%
Is there an elegant Pandas command that can compute this type of time-window based stats/features for each row of data? Or do I need to code a Python function from scratch with proper loops plus if/then-else logic?
I have more experience with SQL so my inclination is to issue multiple "group by" queries to compute each new column separately and join a bunch of tables, in the end, to arrive at the final table/dataset. So a multi-step process instead of an elegant single line of Pandas code with a built-in loop.
Thanks in advance!
You may use the pandas.DataFrame.rolling
method. Check out the documentation .
To provide you an example, suppose that you are working with Apple stock price time series. Here is how the code would look like in order to compute the 5-day mean. Of course, you may chain other metrics such as the sum, or the standard deviation:
>>> aapl = data[['AAPL']].copy()
>>> aapl
AAPL
Date
2010-01-04 30.572857
2010-01-05 30.625713
2010-01-06 30.138571
2010-01-07 30.082857
2010-01-08 30.282858
...
2018-12-25 152.000000
2018-12-26 157.169998
2018-12-27 156.149994
2018-12-28 156.229996
2018-12-31 156.229996
[2346 rows x 1 columns]
>>> aapl['mean_5d'] = aapl.loc[:, ['AAPL']].rolling(5).mean()
>>> aapl
AAPL mean_5d
Date
2010-01-04 30.572857 NaN
2010-01-05 30.625713 NaN
2010-01-06 30.138571 NaN
2010-01-07 30.082857 NaN
2010-01-08 30.282858 30.340571
... ...
2018-12-25 152.000000 153.456000
2018-12-26 157.169998 152.712000
2018-12-27 156.149994 152.575998
2018-12-28 156.229996 153.675998
2018-12-31 156.229996 155.555997
[2346 rows x 2 columns]
>>> aapl['std_5d'] = aapl.loc[:, ['AAPL']].rolling(5).std()
>>> aapl
AAPL mean_5d std_5d
Date
2010-01-04 30.572857 NaN NaN
2010-01-05 30.625713 NaN NaN
2010-01-06 30.138571 NaN NaN
2010-01-07 30.082857 NaN NaN
2010-01-08 30.282858 30.340571 0.247898
... ... ...
2018-12-25 152.000000 153.456000 5.479579
2018-12-26 157.169998 152.712000 4.355022
2018-12-27 156.149994 152.575998 4.202209
2018-12-28 156.229996 153.675998 4.316487
2018-12-31 156.229996 155.555997 2.031717
[2346 rows x 3 columns]
I hope this helps you to write more efficient code using pandas library!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.