[英]Rolling Max/Min/Sum for time series over last x Mins interval
I have a financial time series data.frame with microsecond precision: 我有一个微秒精度的金融时间序列data.frame:
timestamp price volume
2017-08-29 08:00:00.345678 99.1 10
2017-08-29 08:00:00.674566 98.2 5
....
2017-08-29 16:00:00.111234 97.0 3
2017-08-29 16:00:01.445678 96.5 5
In total: around 100k records per day. 总计:每天约10万条记录。
I saw a couple of functions where I can specify the width of the rolling windows, eg k = 10. But the k is expressed as a number of observations and not minutes. 我看到了几个函数,可以指定滚动窗口的宽度,例如k =10。 但是k表示为多个观察值,而不是分钟。
I need to calculate runing/rolling Max, Min of Price series and a runing/rolling sum of Volume series like that: 我需要计算运行/滚动的最大,价格系列的最小值和交易量/滚动的体积系列之和,如下所示:
How to calculate this effectivly? 如何有效计算呢?
I wasn't able to capture milliseconds (but the solution should still work) 我无法捕获毫秒(但该解决方案仍然可以使用)
library(lubridate)
df <- data.frame(timestamp = ymd_hms("2017-08-29 08:00:00.345678", "2017-08-29 08:00:00.674566", "2017-08-29 16:00:00.111234", "2017-08-29 16:00:01.445678"),
price=c(99.1, 98.2, 97.0, 96.5),
volume=c(10,5,3,5))
library(purrr)
library(dplyr)
timeinterval <- 5*60 # 5 minute
Filter df
for observations within time interval, save as list 过滤
df
以获取时间间隔内的观测值,另存为列表
mdf <- map(1:nrow(df), ~df[df$timestamp >= df[.x,]$timestamp & df$timestamp < df[.x,]$timestamp+timeinterval,])
Summarise for each data.frame in list 汇总列表中的每个data.frame
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp = head(timestamp,1),
max.price = max(price),
max.volume = max(volume),
sum.price = sum(price),
sum.volume = sum(volume),
min.price = min(price),
min.volume = min(volume)))
timestamp max.price max.volume sum.price sum.volume
1 2017-08-29 08:00:00 99.1 10 197.3 15
2 2017-08-29 08:00:00 98.2 5 98.2 5
3 2017-08-29 16:00:00 97.0 5 193.5 8
4 2017-08-29 16:00:01 96.5 5 96.5 5
min.price min.volume
1 98.2 5
2 98.2 5
3 96.5 3
4 96.5 5
As I was looking for a backward calculation (start with a timestamp and look 5 minutes back) I slightly modified the great solution by #CPak as follows: 在寻找向后计算时(从时间戳开始,向后看5分钟),我稍微修改了#CPak的出色解决方案,如下所示:
mdf <- map(1:nrow(df), ~df[df$timestamp <= df[.x,]$timestamp & df$timestamp > df[.x,]$timestamp - timeinterval,])
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp_to = tail(timestamp,1),
timestamp_from = head(timestamp,1),
max.price = max(price),
min.price = min(price),
sum.volume = sum(volume),
records = n()))
In addition, I added records = n() to see how many records have been used in the intervals. 另外,我添加了records = n()来查看间隔中使用了多少条记录。
One caveat: the code takes 10 mins on mdf and another 6 mins for statdf on a dataset with 100K+ records. 一个警告:在具有100K +记录的数据集上,代码在mdf上花费10分钟,在statdf上花费6分钟。
Any ideas how to optimize it? 有什么想法如何优化它吗? Thank you!
谢谢!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.