在时差内匹配观测值的最快方法

Question

I'm calculating price differences between trades that have a specific time difference (say 60 seconds). 我正在计算具有特定时间差（例如60秒）的交易之间的价格差。 I need this to be done with several assets and several trades. 我需要用几项资产和几笔交易来完成此任务。 However, I could not figure a way to do this without an eternal for-loop. 但是，如果没有永恒的for循环，我无法找到一种方法。

Let's create some random prices: 让我们创建一些随机价格：

library(birk)
library(tictoc)
library(dplyr)

initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-10-28 17:00:00',tz='GMT')

PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec')
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)

ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))

My main function is the following: For each trade (at the TradePrice column) I need to find closest trade that falls in the 60-second interval. 我的主要功能如下：对于每笔交易（在TradePrice列），我需要找到间隔60秒的最近交易。

calc.spread <- function(df,c=60){
  n<-length(df$PriorityDateTime)
  difft <- dspread <- spread <- rep(0,n)
  TimeF <- as.POSIXct(NA)
  for (k in 1:n){
    diffs <- as.POSIXct(df$PriorityDateTime) - as.POSIXct(df$PriorityDateTime[k])
    idx <- which.closest(diffs,x=c)  
    TimeF[k]<- as.POSIXct(df$PriorityDateTime[idx])
    difft[k] <- difftime(time1 = TimeF[k],time2 = df$PriorityDateTime[k], units = 'sec')
    dspread[k] <- abs(df$TradePrice[k] - df$TradePrice[idx])
    spread[k] <- 2*abs(log(df$TradePrice[k]) - log(df$TradePrice[idx]))

  }

  df <- data.frame(spread,dspread,difft,TimeF,PriorityDateTime=df$PriorityDateTime,id=df$id)
}

The function which.closest is just a wrapper for which.min(abs(vec - x)). 函数which.closest只是which.min（abs（vec-x））的包装。 As I have a data frame with multiple assets, I run: 由于我有一个包含多个资产的数据框，因此运行：

c=60
spreads <- ndf %>% group_by(InstrumentSymbol) %>% do(calc.spread(.,c=c))

The problem is that I need to run this for 3-million row data frames. 问题是我需要为300万行数据帧运行它。 I have searched on the forum but couldn't find a way to run this code faster. 我在论坛上进行了搜索，但找不到更快运行此代码的方法。 Ddply is a little bit slower than using dplyr. Ddply比使用dplyr慢一点。

Is there any suggestion? 有什么建议吗？

Answer 1

You might have made a mistake in the sense that you are not looking for the minimum difference within 60 secs difference as described, but instead you are looking for a trade which took place as close as possible to 60secs in past or future: 您可能在某种意义上犯了一个错误，即您不是在寻找所描述的60秒内的最小差，而是在寻找过去或将来尽可能接近60秒的交易：

idx <- which.closest(diffs,x=c)

Using this a trade which took place 1 sec ago would be discarded for a trade that happened closer to 60 secs away, I don't think that this is what you want. 以此为理由，将1秒钟前发生的交易丢弃到60秒钟以外的交易中，我认为这不是您想要的。 You probably want the lowest price difference for all trades within 60 secs which can be done by: 您可能希望60秒内所有交易的最低价差可以通过以下方法完成：

res$idx[i] <<-  which.min(pricediff)[1]

See the code below: 请参见下面的代码：

library(lubridate)
library(dplyr)
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.frame(stringsAsFactors = F)
res$dspread <- res$idx <- res$spread <- NA
sapply(1:nrow(res),function(i){
  within60 <- abs(difftime(ndf$datetime[i],ndf$datetime,"secs"))<=60
  samesymbol <- res$InstrumentSymbol[i]==res$InstrumentSymbol
  isdifferenttrade <- 1:nrow(res)!=i 
  pricediff <- ifelse(within60&samesymbol&isdifferenttrade,abs(res$TradePrice[i]-res$TradePrice), Inf)

  res$dspread[i] <<-  min(pricediff)
  res$idx[i] <<-  which.min(pricediff)[1] #in case several elements have same price 
  res$spread[i] <<-  2*abs(log(res$TradePrice[i])-log(res$TradePrice[res$idx[i]]))
} )
head(res)

What I used was apply which is similar to (and can be even slower than) for loops. 我使用的是apply它类似于（和可能比慢） for循环。 If this is any faster for your real data, it is because I did the operations in a way which needed less steps. 如果这对于您的真实数据而言更快，那是因为我以较少的步骤进行操作。

Let me know, otherwise you can try the same in a for loop, or we'd have to try with data.table which I am less familiar with. 让我知道，否则您可以在for循环中尝试相同的操作，否则我们将不得不尝试不熟悉的data.table 。 These are generally time consuming of course because you need to define conditions based on each row of data. 当然，这些通常很耗时，因为您需要根据每行数据定义条件。

     PriorityDateTime TradePrice InstrumentSymbol id            datetime    spread idx
1 2018-10-27 10:00:00          1           asset1  1 2018-10-27 10:00:00 2.1972246   3
2 2018-10-27 10:00:30          2           asset2  2 2018-10-27 10:00:30 1.3862944   4
3 2018-10-27 10:01:00          3           asset1  3 2018-10-27 10:01:00 2.1972246   1
4 2018-10-27 10:01:30          4           asset2  4 2018-10-27 10:01:30 1.3862944   2
5 2018-10-27 10:02:00          5           asset1  5 2018-10-27 10:02:00 1.0216512   3
6 2018-10-27 10:02:30          6           asset2  6 2018-10-27 10:02:30 0.8109302   4
  dspread
1       2
2       2
3       2
4       2
5       2
6       2

Answer 2

Being quite unsatisfied by my own previous answer, I asked here for help and turns out there is at least one way in data.table which is clearly faster. 由于我以前的回答很不满意，我在这里寻求帮助，结果发现data.table中至少有一种方法显然更快。 Also made a dplyr-related question here 在这里也提出了与dplyr相关的问题

s <- Sys.time()
initial.date <- as.POSIXct('2018-10-27 10:00:00',tz='GMT')
last.date <- as.POSIXct('2018-12-28 17:00:00',tz='GMT')
PriorityDateTime=seq.POSIXt(from=initial.date,to = last.date,by = '30 sec');length(PriorityDateTime)
TradePrice=seq(from=1, to=length(PriorityDateTime),by = 1)
ndf<- data.frame(PriorityDateTime,TradePrice)
ndf$InstrumentSymbol <- rep_len(x = c('asset1','asset2'),length.out = length(ndf$PriorityDateTime))
ndf$id <- seq(1:length(x = ndf$InstrumentSymbol))
ndf$datetime <- ymd_hms(ndf$PriorityDateTime)
res <- ndf %>% data.table()
res2 <- setDT(res)
res2 <- res2[, `:=` (min_60 = datetime - 60, plus_60 = datetime + 60, idx = .I)][
  res2,  on = .(InstrumentSymbol = InstrumentSymbol, datetime >= min_60, datetime <= plus_60), allow.cartesian = TRUE][
    idx != i.idx, .SD[which.min(abs(i.TradePrice - TradePrice))], by = id][
      , .(id, minpricewithin60 = i.TradePrice, index.minpricewithin60 = i.idx)][
        res, on = .(id)][, `:=` (min_60 = NULL, plus_60 = NULL, idx = NULL)]
res2[]
e <- Sys.time()
> e-s
Time difference of 1.23701 mins

You can then apply your calc.spread function directly to the minpricewithin60 column. 然后，您可以将calc.spread函数直接应用于minpricewithin60列。

在时差内匹配观测值的最快方法

问题描述

2 个解决方案

解决方案1
0 2018-10-28 10:08:07

解决方案2
0 2018-10-28 20:12:27

在时差内匹配观测值的最快方法

问题描述

2 个解决方案

解决方案1 0 2018-10-28 10:08:07

解决方案2 0 2018-10-28 20:12:27

解决方案1
0 2018-10-28 10:08:07

解决方案2
0 2018-10-28 20:12:27