简体   繁体   English

在过去x分钟间隔内滚动时间序列的最大/最小/总和

[英]Rolling Max/Min/Sum for time series over last x Mins interval

I have a financial time series data.frame with microsecond precision: 我有一个微秒精度的金融时间序列data.frame:

timestamp                    price  volume
2017-08-29 08:00:00.345678   99.1   10
2017-08-29 08:00:00.674566   98.2   5
....
2017-08-29 16:00:00.111234   97.0   3
2017-08-29 16:00:01.445678   96.5   5

In total: around 100k records per day. 总计:每天约10万条记录。

I saw a couple of functions where I can specify the width of the rolling windows, eg k = 10. But the k is expressed as a number of observations and not minutes. 我看到了几个函数,可以指定滚动窗口的宽度,例如k =10。 但是k表示为多个观察值,而不是分钟。

I need to calculate runing/rolling Max, Min of Price series and a runing/rolling sum of Volume series like that: 我需要计算运行/滚动的最大,价格系列的最小值和交易量/滚动的体积系列之和,如下所示:

  1. starting with a timestamp exactly 5 minutes after the begin of the time series 从时间戳记开始,恰好是时间序列开始后的5分钟
  2. for every following timestamp: look back for 5 minutes interval and 对于以下每个时间戳:回顾5分钟间隔,然后
  3. calculate the rolling statistics. 计算滚动统计。

How to calculate this effectivly? 如何有效计算呢?

Your data 您的资料

I wasn't able to capture milliseconds (but the solution should still work) 我无法捕获毫秒(但该解决方案仍然可以使用)

library(lubridate)
df <- data.frame(timestamp = ymd_hms("2017-08-29 08:00:00.345678", "2017-08-29 08:00:00.674566", "2017-08-29 16:00:00.111234", "2017-08-29 16:00:01.445678"),
                 price=c(99.1, 98.2, 97.0, 96.5),
                 volume=c(10,5,3,5))

purrr and dplyr solution Purrr和Dplyr解决方案

library(purrr)
library(dplyr)
timeinterval <- 5*60   # 5 minute

Filter df for observations within time interval, save as list 过滤df以获取时间间隔内的观测值,另存为列表

mdf <- map(1:nrow(df), ~df[df$timestamp >= df[.x,]$timestamp & df$timestamp < df[.x,]$timestamp+timeinterval,])

Summarise for each data.frame in list 汇总列表中的每个data.frame

statdf <- map_df(mdf, ~.x %>% 
                          summarise(timestamp = head(timestamp,1),
                                    max.price = max(price), 
                                    max.volume = max(volume),
                                    sum.price = sum(price),
                                    sum.volume = sum(volume),
                                    min.price = min(price), 
                                    min.volume = min(volume)))

Output 产量

                timestamp max.price max.volume sum.price sum.volume
1 2017-08-29 08:00:00      99.1         10     197.3         15
2 2017-08-29 08:00:00      98.2          5      98.2          5
3 2017-08-29 16:00:00      97.0          5     193.5          8
4 2017-08-29 16:00:01      96.5          5      96.5          5
  min.price min.volume
1      98.2          5
2      98.2          5
3      96.5          3
4      96.5          5

As I was looking for a backward calculation (start with a timestamp and look 5 minutes back) I slightly modified the great solution by #CPak as follows: 在寻找向后计算时(从时间戳开始,向后看5分钟),我稍微修改了#CPak的出色解决方案,如下所示:

mdf <- map(1:nrow(df), ~df[df$timestamp <= df[.x,]$timestamp & df$timestamp > df[.x,]$timestamp - timeinterval,])

statdf <- map_df(mdf, ~.x %>% 
                      summarise(timestamp_to = tail(timestamp,1),
                                timestamp_from = head(timestamp,1),
                                max.price = max(price), 
                                min.price = min(price),
                                sum.volume = sum(volume),
                                records = n()))

In addition, I added records = n() to see how many records have been used in the intervals. 另外,我添加了records = n()来查看间隔中使用了多少条记录。

One caveat: the code takes 10 mins on mdf and another 6 mins for statdf on a dataset with 100K+ records. 一个警告:在具有100K +记录的数据集上,代码在mdf上花费10分钟,在statdf上花费6分钟。

Any ideas how to optimize it? 有什么想法如何优化它吗? Thank you! 谢谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM