简体   繁体   English

滚动时间窗口功能 — 使用 Pandas 进行数据整理

[英]Rolling Time-Window Features — Data Wrangling with Pandas

I have a dataset where each record contains match-level data such as MATCH_DATE |我有一个数据集,其中每条记录都包含匹配级别的数据,例如 MATCH_DATE | PLAYER1 |播放器1 | PLAYER2 |播放器2 | TOURNAMENT |锦标赛 | SURFACE |表面 | PLAYER1_SERVE% | PLAYER1_SERVE% | PLAYER2_SERVE% PLAYER2_SERVE%

09DEC2020 | 2020 年 12 月 9 日 | Mike |迈克 | Jim |吉姆 | Rome Open |罗马公开赛 | Clay |粘土 | 65% | 65% | 70% 70%

I'm trying to create new columns that are rolling time-window based per "PLAYER AND SURFACE", eg, LAST90DAYS_PLAYER1_CLAYSERVE% and LAST5MATCHES_PLAYER1_CLAYSERVE%.我正在尝试创建基于每个“PLAYER AND SURFACE”的滚动时间窗口的新列,例如 LAST90DAYS_PLAYER1_CLAYSERVE% 和 LAST5MATCHES_PLAYER1_CLAYSERVE%。 Note that those two fields should be for the same SURFACE specified in the subject matter record.请注意,这两个字段应针对主题记录中指定的相同 SURFACE。

I then need to add/append those new columns to the original dataset to arrive at a final dataset like DATE |然后,我需要将这些新列添加/追加到原始数据集中,以得到最终数据集,如 DATE | PLAYER1 |播放器1 | PLAYER2 |播放器2 | TOURNAMENT |锦标赛 | SURFACE |表面 | PLAYER1_SERVE% | PLAYER1_SERVE% | PLAYER2_SERVE% | PLAYER2_SERVE% | LAST90DAYS_PLAYER1_CLAYSERVE% | LAST90DAYS_PLAYER1_CLAYSERVE% | LAST5MATCHES_PLAYER1_CLAYSERVE% LAST5MATCHES_PLAYER1_CLAYSERVE%

09DEC2020 | 2020 年 12 月 9 日 | Mike |迈克 | Jim |吉姆 | Rome Open |罗马公开赛 | Clay |粘土 | 65% | 65% | 70% | 70% | 62.5% | 62.5% | 69.2% 69.2%

Is there an elegant Pandas command that can compute this type of time-window based stats/features for each row of data?是否有一个优雅的 Pandas 命令可以为每行数据计算这种基于时间窗口的统计/特征? Or do I need to code a Python function from scratch with proper loops plus if/then-else logic?或者我是否需要使用适当的循环加上 if/then-else 逻辑从头开始编写 Python function ?

I have more experience with SQL so my inclination is to issue multiple "group by" queries to compute each new column separately and join a bunch of tables, in the end, to arrive at the final table/dataset.我对 SQL 有更多的经验,所以我倾向于发出多个“分组依据”查询来分别计算每个新列并连接一堆表,最后到达最终表/数据集。 So a multi-step process instead of an elegant single line of Pandas code with a built-in loop.因此,这是一个多步骤的过程,而不是带有内置循环的优雅的单行 Pandas 代码。

Thanks in advance!提前致谢!

You may use the pandas.DataFrame.rolling method.您可以使用pandas.DataFrame.rolling方法。 Check out the documentation .查看文档

To provide you an example, suppose that you are working with Apple stock price time series.举个例子,假设您正在使用 Apple 股票价格时间序列。 Here is how the code would look like in order to compute the 5-day mean.下面是计算 5 天平均值的代码的样子。 Of course, you may chain other metrics such as the sum, or the standard deviation:当然,您可以链接其他指标,例如总和或标准差:

>>> aapl = data[['AAPL']].copy()
>>> aapl
                  AAPL
Date                  
2010-01-04   30.572857
2010-01-05   30.625713
2010-01-06   30.138571
2010-01-07   30.082857
2010-01-08   30.282858
                ...
2018-12-25  152.000000
2018-12-26  157.169998
2018-12-27  156.149994
2018-12-28  156.229996
2018-12-31  156.229996
[2346 rows x 1 columns]

>>> aapl['mean_5d'] = aapl.loc[:, ['AAPL']].rolling(5).mean()
>>> aapl
                  AAPL     mean_5d
Date                              
2010-01-04   30.572857         NaN
2010-01-05   30.625713         NaN
2010-01-06   30.138571         NaN
2010-01-07   30.082857         NaN
2010-01-08   30.282858   30.340571
                ...         ...
2018-12-25  152.000000  153.456000
2018-12-26  157.169998  152.712000
2018-12-27  156.149994  152.575998
2018-12-28  156.229996  153.675998
2018-12-31  156.229996  155.555997
[2346 rows x 2 columns]

>>> aapl['std_5d'] = aapl.loc[:, ['AAPL']].rolling(5).std()
>>> aapl
                  AAPL     mean_5d    std_5d
Date                                        
2010-01-04   30.572857         NaN       NaN
2010-01-05   30.625713         NaN       NaN
2010-01-06   30.138571         NaN       NaN
2010-01-07   30.082857         NaN       NaN
2010-01-08   30.282858   30.340571  0.247898
                ...         ...       ...
2018-12-25  152.000000  153.456000  5.479579
2018-12-26  157.169998  152.712000  4.355022
2018-12-27  156.149994  152.575998  4.202209
2018-12-28  156.229996  153.675998  4.316487
2018-12-31  156.229996  155.555997  2.031717
[2346 rows x 3 columns]

I hope this helps you to write more efficient code using pandas library!我希望这可以帮助您使用 pandas 库编写更高效的代码!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM