[英]Fastest way of matching observations within time difference
I'm calculating price differences between trades that have a specific time difference (say 60 seconds). 我正在计算具有特定时间差(例如60秒)的交易之间的价格差。 I need this to be done with several assets and several trades.
我需要用几项资产和几笔交易来完成此任务。 However, I could not figure a way to do this without an eternal for-loop.
但是,如果没有永恒的for循环,我无法找到一种方法。
Let's create some random prices: 让我们创建一些随机价格:
library(birk)
library(tictoc)
library(dplyr)
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-10-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
My main function is the following: For each trade (at the TradePrice column) I need to find closest trade that falls in the 60-second interval. 我的主要功能如下:对于每笔交易(在TradePrice列),我需要找到间隔60秒的最近交易。
calc.spread <- function(df,c=60){
n<-length(df$PriorityDateTime)
difft <- dspread <- spread <- rep(0,n)
TimeF <- as.POSIXct(NA)
for (k in 1:n){
diffs <- as.POSIXct(df$PriorityDateTime) - as.POSIXct(df$PriorityDateTime[k])
idx <- which.closest(diffs,x=c)
TimeF[k]<- as.POSIXct(df$PriorityDateTime[idx])
difft[k] <- difftime(time1 = TimeF[k],time2 = df$PriorityDateTime[k], units = 'sec')
dspread[k] <- abs(df$TradePrice[k] - df$TradePrice[idx])
spread[k] <- 2*abs(log(df$TradePrice[k]) - log(df$TradePrice[idx]))
}
df <- data.frame(spread,dspread,difft,TimeF,PriorityDateTime=df$PriorityDateTime,id=df$id)
}
The function which.closest is just a wrapper for which.min(abs(vec - x)). 函数which.closest只是which.min(abs(vec-x))的包装。 As I have a data frame with multiple assets, I run:
由于我有一个包含多个资产的数据框,因此运行:
c=60
spreads <- ndf %>% group_by(InstrumentSymbol) %>% do(calc.spread(.,c=c))
The problem is that I need to run this for 3-million row data frames. 问题是我需要为300万行数据帧运行它。 I have searched on the forum but couldn't find a way to run this code faster.
我在论坛上进行了搜索,但找不到更快运行此代码的方法。 Ddply is a little bit slower than using dplyr.
Ddply比使用dplyr慢一点。
Is there any suggestion? 有什么建议吗?
You might have made a mistake in the sense that you are not looking for the minimum difference within 60 secs difference as described, but instead you are looking for a trade which took place as close as possible to 60secs in past or future: 您可能在某种意义上犯了一个错误,即您不是在寻找所描述的60秒内的最小差,而是在寻找过去或将来尽可能接近60秒的交易:
idx <- which.closest(diffs,x=c)
Using this a trade which took place 1 sec ago would be discarded for a trade that happened closer to 60 secs away, I don't think that this is what you want. 以此为理由,将1秒钟前发生的交易丢弃到60秒钟以外的交易中,我认为这不是您想要的。 You probably want the lowest price difference for all trades within 60 secs which can be done by:
您可能希望60秒内所有交易的最低价差可以通过以下方法完成:
res$idx[i] <<- which.min(pricediff)[1]
See the code below: 请参见下面的代码:
library(lubridate)
library(dplyr)
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.frame(stringsAsFactors = F)
res$dspread <- res$idx <- res$spread <- NA
sapply(1:nrow(res),function(i){
within60 <- abs(difftime(ndf$datetime[i],ndf$datetime,"secs"))<=60
samesymbol <- res$InstrumentSymbol[i]==res$InstrumentSymbol
isdifferenttrade <- 1:nrow(res)!=i
pricediff <- ifelse(within60&samesymbol&isdifferenttrade,abs(res$TradePrice[i]-res$TradePrice), Inf)
res$dspread[i] <<- min(pricediff)
res$idx[i] <<- which.min(pricediff)[1] #in case several elements have same price
res$spread[i] <<- 2*abs(log(res$TradePrice[i])-log(res$TradePrice[res$idx[i]]))
} )
head(res)
What I used was apply
which is similar to (and can be even slower than) for
loops. 我使用的是
apply
它类似于(和可能比慢) for
循环。 If this is any faster for your real data, it is because I did the operations in a way which needed less steps. 如果这对于您的真实数据而言更快,那是因为我以较少的步骤进行操作。
Let me know, otherwise you can try the same in a for
loop, or we'd have to try with data.table
which I am less familiar with. 让我知道,否则您可以在
for
循环中尝试相同的操作,否则我们将不得不尝试不熟悉的data.table
。 These are generally time consuming of course because you need to define conditions based on each row of data. 当然,这些通常很耗时,因为您需要根据每行数据定义条件。
PriorityDateTime TradePrice InstrumentSymbol id datetime spread idx
1 2018-10-27 10:00:00 1 asset1 1 2018-10-27 10:00:00 2.1972246 3
2 2018-10-27 10:00:30 2 asset2 2 2018-10-27 10:00:30 1.3862944 4
3 2018-10-27 10:01:00 3 asset1 3 2018-10-27 10:01:00 2.1972246 1
4 2018-10-27 10:01:30 4 asset2 4 2018-10-27 10:01:30 1.3862944 2
5 2018-10-27 10:02:00 5 asset1 5 2018-10-27 10:02:00 1.0216512 3
6 2018-10-27 10:02:30 6 asset2 6 2018-10-27 10:02:30 0.8109302 4
dspread
1 2
2 2
3 2
4 2
5 2
6 2
Being quite unsatisfied by my own previous answer, I asked here for help and turns out there is at least one way in data.table
which is clearly faster. 由于我以前的回答很不满意,我在这里寻求帮助,结果发现
data.table
中至少有一种方法显然更快。 Also made a dplyr-related question here 在这里也提出了与dplyr相关的问题
s <- Sys.time()
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec');length(PriorityDateTime)
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.table()
res2 <- setDT(res)
res2 <- res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
res2, on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
, .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]
res2[]
e <- Sys.time()
> e-s
Time difference of 1.23701 mins
You can then apply your calc.spread
function directly to the minpricewithin60
column. 然后,您可以将
calc.spread
函数直接应用于minpricewithin60
列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.