[英]Full outer join with two-sided roll (LOCF)
How to efficiently merge two data.table
s with full outer join, while handling missing values with rolling last observation forward (LOCF) on the both left and right sides ? 如何有效地合并两个data.table
s与全外连接,同时处理左侧和右侧滚动最后一次观察前进(LOCF)的缺失值?
Real world application - there are two not necessarily interleaving trading rule signal tables, X
, Y
, holding ( sparse ) signal values over time. 真实世界的应用 - 有两个不一定交错的交易规则信号表, X
, Y
,随时间保持( 稀疏 )信号值。 The overall goal is to define composite signal, where Signal.z = Signal.x AND Signal.y 总体目标是定义复合信号,其中Signal.z = Signal.x AND Signal.y
X <- data.table(Instrument=rep("SPX",3)
, Date=as.IDate(c("2013-11-20","2013-11-22","2013-11-24"))
, Signal=c(TRUE,FALSE,TRUE), key=c("Instrument", "Date"))
Y <- data.table(Instrument=rep("SPX",3)
, Date=as.IDate(c("2013-11-21","2013-11-23","2013-11-25"))
, Signal=c(FALSE,TRUE,FALSE), key=c("Instrument", "Date"))
Desired outcome : 期望的结果 :
Instrument Date Signal.x Signal.y Signal.z
1: SPX 2013-11-20 TRUE NA NA
2: SPX 2013-11-21 TRUE FALSE FALSE
3: SPX 2013-11-22 FALSE FALSE FALSE
4: SPX 2013-11-23 FALSE TRUE FALSE
5: SPX 2013-11-24 TRUE TRUE TRUE
6: SPX 2013-11-25 TRUE FALSE FALSE
Something like this perhaps: 也许这样的东西:
dates = sort(c(X$Date, Y$Date))
setkey(X, Date)
setkey(Y, Date)
Z = X[J(dates), roll = T][,
Signal.y := Y[J(dates), roll = T]$Signal][,
Signal.z := as.logical(Signal * Signal.y)]
Building on this idea, here's a way of doing it for your large example data: 基于这个想法,这是一种为大型示例数据执行此操作的方法:
# assuming keys are set to Instrument, Date in both data.tables
Z = unique(setkey(rbind(setnames(X[Y, roll = T],
c("Instrument", "Date", "Signal.x", "Signal.y")),
setnames(Y[X, roll = T],
c("Instrument", "Date", "Signal.y", "Signal.x")),
use.names = TRUE),
Instrument, Date))[,
Signal.z := as.logical(Signal.x * Signal.y)]
Linked here is an excellent answer from mnel explaining how to do a full outer join in the data.table
package. 链接在这里是一个很好的答案 ,从MNEL解释如何做一个完全外部联接在data.table
包。
The application here is straightforward, adding the wrinkle of rolling the last observation forward (via roll = TRUE
in a join). 这里的应用程序是直截了当的,增加了向前滚动最后一个观察的皱纹(通过连接中的roll = TRUE
)。
Create a data.table holding all (unique) keys in either X
or Y
. 创建一个data.table,包含X
或Y
所有(唯一)键。
## one way to do the outer join
keys <- unique(rbind(X[,key(X),with = FALSE], Y[,key(Y), with = FALSE]))
## alternate way if you have multiple data.tables to outer join
keys <- lapply(list(X,Y), function(z) z[,key(z), with = FALSE])
keys <- rbindlist(keys)
## this setkey is mostly cosmetic -
## determines whether the final output is sorted or not
setkeyv(keys, names(keys))
##cosmetic changing of column names to minimize confusion
setnames(X,"Signal","Signal.X")
setnames(Y,"Signal","Signal.Y")
## two joins, followed by the definition of the new column
X[Y[keys, roll = TRUE], roll = TRUE][,
Signal.Z := as.logical(Signal.X * Signal.Y)]
## this output is returned invisibly. either assign it or force print
.Last.value
# Instrument Date Signal.X Signal.Y Signal.Z
# 1: SPX 2013-11-20 TRUE NA NA
# 2: SPX 2013-11-21 TRUE FALSE FALSE
# 3: SPX 2013-11-22 FALSE FALSE FALSE
# 4: SPX 2013-11-23 FALSE TRUE FALSE
# 5: SPX 2013-11-24 TRUE TRUE TRUE
# 6: SPX 2013-11-25 TRUE FALSE FALSE
The idiom as.logical(. * .)
to replicate &
where NA
propagates is inspired by Eddi's answer . 复制&
传播NA
传播的成语as.logical(. * .)
灵感来自Eddi的 答案 。
I am going to measure times of the three available solutions (Daniel.Krizian, Blue.Magister, eddi). 我将测量三种可用解决方案的时间(Daniel.Krizian,Blue.Magister,eddi)。
For this purpose I created bigger, benchmark data - large signal tables X
and Y
为此,我创建了更大的基准数据 - 大信号表X
和Y
X
and Y
tables 基准数据: X
和Y
表 nobs <- 5000 # number of observations for each instrument
nopps <- nobs * 3 # opportunities to trade in the time window studied
ninstr <- 200 # number of instruments
set.seed(2) # set.seed(1) generates "MPM" instrument twice :)
universe <- replicate( ninstr , paste( sample( LETTERS , 3 , repl = TRUE ), collapse = "" ) )
window <- as.Date("2013-11-26") - 1:nopps + 1
frame <- CJ(Instrument=universe, Date=rep(1:nobs))
gen.sig.tbl <- function() {
frame[, Date:= as.IDate(sample(window, size=nobs, replace=F)), by="Instrument"]
setkey(frame,Instrument,Date)
rnd.sig.sparse <- function(nobs) {
frst <- sample(c(FALSE,TRUE), 1)
rep(c(frst,!frst), nobs/2)
}
frame[, Signal:=rnd.sig.sparse(nobs), by="Instrument"]
return(copy(frame))
}
set.seed(1)
X <- gen.sig.tbl()
set.seed(2)
Y <- gen.sig.tbl()
X
Instrument Date Signal
1: AAS 1972-11-02 FALSE
2: AAS 1972-11-04 TRUE
3: AAS 1972-11-07 FALSE
4: AAS 1972-11-08 TRUE
5: AAS 1972-11-10 FALSE
---
999996: ZVH 2013-11-14 FALSE
999997: ZVH 2013-11-15 TRUE
999998: ZVH 2013-11-18 FALSE
999999: ZVH 2013-11-25 TRUE
1000000: ZVH 2013-11-26 FALSE
Y
Instrument Date Signal
1: AAS 1972-11-13 TRUE
2: AAS 1972-11-17 FALSE
3: AAS 1972-11-20 TRUE
4: AAS 1972-11-21 FALSE
5: AAS 1972-11-23 TRUE
---
999996: ZVH 2013-11-16 TRUE
999997: ZVH 2013-11-19 FALSE
999998: ZVH 2013-11-23 TRUE
999999: ZVH 2013-11-24 FALSE
1000000: ZVH 2013-11-25 TRUE
Daniel.Krizian <- function () {
Z <- merge(X, Y, all=TRUE)[, c("Signal.x","Signal.y"):=list( na.locf(Signal.x, na.rm = F)
, na.locf(Signal.y, na.rm = F))
, by=Instrument]
Z[, Signal.z := Signal.x & Signal.y]
# and the last line because (FALSE & NA) == FALSE, whereas NA result is desired
Z[, Signal.z := ifelse(is.na(Signal.x) | is.na(Signal.y), NA, Signal.z)]
return(Z)
}
Blue.Magister <- function() {
keys <- unique(rbind(X[,key(X),with = FALSE], Y[,key(Y), with = FALSE]))
## this setkey is mostly cosmetic -
## determines whether the final output is sorted or not
setkeyv(keys, names(keys))
##cosmetic changing of column names to minimize confusion
setnames(X,"Signal","Signal.X")
setnames(Y,"Signal","Signal.Y")
## two joins, followed by the definition of the new column
Z <- X[Y[keys, roll = TRUE], roll = TRUE][,
Signal.Z := as.logical(Signal.X * Signal.Y)]
Z <- unique(Z)
return(Z)
}
eddi <- function (){
# assuming keys are set to Instrument, Date in both data.tables
Z = unique(setkey(rbind(setnames(X[Y, roll = T],
c("Instrument", "Date", "Signal.x", "Signal.y")),
setnames(Y[X, roll = T],
c("Instrument", "Date", "Signal.y", "Signal.x")),
use.names = TRUE),
Instrument, Date))[,
Signal.z := as.logical(Signal.x * Signal.y)]
return(Z)
}
system.time(Z.DK <- Daniel.Krizian())
user system elapsed
2.70 0.07 3.01
system.time(Z.eddi <- eddi())
user system elapsed
1.14 0.03 1.84
system.time(Z.BM <- Blue.Magister())
user system elapsed
3.35 0.14 3.52
setnames(X,"Signal.X", "Signal") # reset original data back after Blue.Magister() call
setnames(Y,"Signal.Y", "Signal") # reset original data back after Blue.Magister() call
setnames(Z.BM
, c("Signal.X", "Signal.Y", "Signal.Z")
, c("Signal.x", "Signal.y", "Signal.z"))
identical(Z.DK, Z.BM)
TRUE
identical(Z.DK, Z.eddi)
TRUE
My solution is following; 我的解决方案如下; if you know of more efficient approach, let me know! 如果你知道更有效的方法,请告诉我!
Z <- merge(X, Y, all=TRUE)[, c("Signal.x","Signal.y"):=list( na.locf(Signal.x, na.rm = F)
, na.locf(Signal.y, na.rm = F))
, by=Instrument]
Z[, Signal.z := Signal.x & Signal.y]
# and the last line because (FALSE & NA) == FALSE, whereas NA result is desired
Z[, Signal.z := ifelse(is.na(Signal.x) | is.na(Signal.y), NA, Signal.z)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.