简体   繁体   English

在时差内匹配观测值的最快方法

[英]Fastest way of matching observations within time difference

I'm calculating price differences between trades that have a specific time difference (say 60 seconds). 我正在计算具有特定时间差(例如60秒)的交易之间的价格差。 I need this to be done with several assets and several trades. 我需要用几项资产和几笔交易来完成此任务。 However, I could not figure a way to do this without an eternal for-loop. 但是,如果没有永恒的for循环,我无法找到一种方法。

Let's create some random prices: 让我们创建一些随机价格:

library(birk)
library(tictoc)
library(dplyr)

initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-10-28 17:00:00',tz='GMT')

PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)

ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))

My main function is the following: For each trade (at the TradePrice column) I need to find closest trade that falls in the 60-second interval. 我的主要功能如下:对于每笔交易(在TradePrice列),我需要找到间隔60秒的最近交易。

calc.spread <- function(df,c=60){
  n<-length(df$PriorityDateTime)
  difft <- dspread <- spread <- rep(0,n)
  TimeF <- as.POSIXct(NA)
  for (k in 1:n){
    diffs <- as.POSIXct(df$PriorityDateTime) - as.POSIXct(df$PriorityDateTime[k])
    idx <- which.closest(diffs,x=c)  
    TimeF[k]<- as.POSIXct(df$PriorityDateTime[idx])
    difft[k] <- difftime(time1 = TimeF[k],time2 = df$PriorityDateTime[k], units = 'sec')
    dspread[k] <- abs(df$TradePrice[k] - df$TradePrice[idx])
    spread[k] <- 2*abs(log(df$TradePrice[k]) - log(df$TradePrice[idx]))

  }

  df <- data.frame(spread,dspread,difft,TimeF,PriorityDateTime=df$PriorityDateTime,id=df$id)
}

The function which.closest is just a wrapper for which.min(abs(vec - x)). 函数which.closest只是which.min(abs(vec-x))的包装。 As I have a data frame with multiple assets, I run: 由于我有一个包含多个资产的数据框,因此运行:

c=60
spreads <- ndf %>% group_by(InstrumentSymbol) %>% do(calc.spread(.,c=c))

The problem is that I need to run this for 3-million row data frames. 问题是我需要为300万行数据帧运行它。 I have searched on the forum but couldn't find a way to run this code faster. 我在论坛上进行了搜索,但找不到更快运行此代码的方法。 Ddply is a little bit slower than using dplyr. Ddply比使用dplyr慢一点。

Is there any suggestion? 有什么建议吗?

You might have made a mistake in the sense that you are not looking for the minimum difference within 60 secs difference as described, but instead you are looking for a trade which took place as close as possible to 60secs in past or future: 您可能在某种意义上犯了一个错误,即您不是在寻找所描述的60秒内的最小差,而是在寻找过去或将来尽可能接近60秒的交易:

idx <- which.closest(diffs,x=c)

Using this a trade which took place 1 sec ago would be discarded for a trade that happened closer to 60 secs away, I don't think that this is what you want. 以此为理由,将1秒钟前发生的交易丢弃到60秒钟以外的交易中,我认为这不是您想要的。 You probably want the lowest price difference for all trades within 60 secs which can be done by: 您可能希望60秒内所有交易的最低价差可以通过以下方法完成:

res$idx[i] <<-  which.min(pricediff)[1]

See the code below: 请参见下面的代码:

library(lubridate)
library(dplyr)
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.frame(stringsAsFactors = F)
res$dspread <- res$idx <- res$spread <- NA
sapply(1:nrow(res),function(i){
  within60 <- abs(difftime(ndf$datetime[i],ndf$datetime,"secs"))<=60
  samesymbol <- res$InstrumentSymbol[i]==res$InstrumentSymbol
  isdifferenttrade <- 1:nrow(res)!=i 
  pricediff <- ifelse(within60&samesymbol&isdifferenttrade,abs(res$TradePrice[i]-res$TradePrice), Inf)

  res$dspread[i] <<-  min(pricediff)
  res$idx[i] <<-  which.min(pricediff)[1] #in case several elements have same price 
  res$spread[i] <<-  2*abs(log(res$TradePrice[i])-log(res$TradePrice[res$idx[i]]))
} )
head(res)

What I used was apply which is similar to (and can be even slower than) for loops. 我使用的是apply它类似于(和可能比慢) for循环。 If this is any faster for your real data, it is because I did the operations in a way which needed less steps. 如果这对于您的真实数据而言更快,那是因为我以较少的步骤进行操作。

Let me know, otherwise you can try the same in a for loop, or we'd have to try with data.table which I am less familiar with. 让我知道,否则您可以在for循环中尝试相同的操作,否则我们将不得不尝试不熟悉的data.table These are generally time consuming of course because you need to define conditions based on each row of data. 当然,这些通常很耗时,因为您需要根据每行数据定义条件。

     PriorityDateTime TradePrice InstrumentSymbol id            datetime    spread idx
1 2018-10-27 10:00:00          1           asset1  1 2018-10-27 10:00:00 2.1972246   3
2 2018-10-27 10:00:30          2           asset2  2 2018-10-27 10:00:30 1.3862944   4
3 2018-10-27 10:01:00          3           asset1  3 2018-10-27 10:01:00 2.1972246   1
4 2018-10-27 10:01:30          4           asset2  4 2018-10-27 10:01:30 1.3862944   2
5 2018-10-27 10:02:00          5           asset1  5 2018-10-27 10:02:00 1.0216512   3
6 2018-10-27 10:02:30          6           asset2  6 2018-10-27 10:02:30 0.8109302   4
  dspread
1       2
2       2
3       2
4       2
5       2
6       2

Being quite unsatisfied by my own previous answer, I asked here for help and turns out there is at least one way in data.table which is clearly faster. 由于我以前的回答很不满意,我在这里寻求帮助,结果发现data.table中至少有一种方法显然更快。 Also made a dplyr-related question here 在这里也提出了与dplyr相关的问题

s <- Sys.time()
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec');length(PriorityDateTime)
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.table()
res2 <- setDT(res)
res2 <- res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
  res2,  on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
    idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
      , .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
        res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]
res2[]
e <- Sys.time()
> e-s
Time difference of 1.23701 mins

You can then apply your calc.spread function directly to the minpricewithin60 column. 然后,您可以将calc.spread函数直接应用于minpricewithin60列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM