简体   繁体   中英

Rolling Time-Window Features — Data Wrangling with Pandas

I have a dataset where each record contains match-level data such as MATCH_DATE | PLAYER1 | PLAYER2 | TOURNAMENT | SURFACE | PLAYER1_SERVE% | PLAYER2_SERVE%

09DEC2020 | Mike | Jim | Rome Open | Clay | 65% | 70%

I'm trying to create new columns that are rolling time-window based per "PLAYER AND SURFACE", eg, LAST90DAYS_PLAYER1_CLAYSERVE% and LAST5MATCHES_PLAYER1_CLAYSERVE%. Note that those two fields should be for the same SURFACE specified in the subject matter record.

I then need to add/append those new columns to the original dataset to arrive at a final dataset like DATE | PLAYER1 | PLAYER2 | TOURNAMENT | SURFACE | PLAYER1_SERVE% | PLAYER2_SERVE% | LAST90DAYS_PLAYER1_CLAYSERVE% | LAST5MATCHES_PLAYER1_CLAYSERVE%

09DEC2020 | Mike | Jim | Rome Open | Clay | 65% | 70% | 62.5% | 69.2%

Is there an elegant Pandas command that can compute this type of time-window based stats/features for each row of data? Or do I need to code a Python function from scratch with proper loops plus if/then-else logic?

I have more experience with SQL so my inclination is to issue multiple "group by" queries to compute each new column separately and join a bunch of tables, in the end, to arrive at the final table/dataset. So a multi-step process instead of an elegant single line of Pandas code with a built-in loop.

Thanks in advance!

You may use the pandas.DataFrame.rolling method. Check out the documentation .

To provide you an example, suppose that you are working with Apple stock price time series. Here is how the code would look like in order to compute the 5-day mean. Of course, you may chain other metrics such as the sum, or the standard deviation:

>>> aapl = data[['AAPL']].copy()
>>> aapl
                  AAPL
Date                  
2010-01-04   30.572857
2010-01-05   30.625713
2010-01-06   30.138571
2010-01-07   30.082857
2010-01-08   30.282858
                ...
2018-12-25  152.000000
2018-12-26  157.169998
2018-12-27  156.149994
2018-12-28  156.229996
2018-12-31  156.229996
[2346 rows x 1 columns]

>>> aapl['mean_5d'] = aapl.loc[:, ['AAPL']].rolling(5).mean()
>>> aapl
                  AAPL     mean_5d
Date                              
2010-01-04   30.572857         NaN
2010-01-05   30.625713         NaN
2010-01-06   30.138571         NaN
2010-01-07   30.082857         NaN
2010-01-08   30.282858   30.340571
                ...         ...
2018-12-25  152.000000  153.456000
2018-12-26  157.169998  152.712000
2018-12-27  156.149994  152.575998
2018-12-28  156.229996  153.675998
2018-12-31  156.229996  155.555997
[2346 rows x 2 columns]

>>> aapl['std_5d'] = aapl.loc[:, ['AAPL']].rolling(5).std()
>>> aapl
                  AAPL     mean_5d    std_5d
Date                                        
2010-01-04   30.572857         NaN       NaN
2010-01-05   30.625713         NaN       NaN
2010-01-06   30.138571         NaN       NaN
2010-01-07   30.082857         NaN       NaN
2010-01-08   30.282858   30.340571  0.247898
                ...         ...       ...
2018-12-25  152.000000  153.456000  5.479579
2018-12-26  157.169998  152.712000  4.355022
2018-12-27  156.149994  152.575998  4.202209
2018-12-28  156.229996  153.675998  4.316487
2018-12-31  156.229996  155.555997  2.031717
[2346 rows x 3 columns]

I hope this helps you to write more efficient code using pandas library!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM