R data.table ordered column lookup

Question

I have an R data.table with an id column and multiple columns specifying an ordered threshold level and a corresponding value. What I would like to do is lookup for each row the first level which is greater or equal than the parameter for that id and return the corresponding value.

Here is an example data set.

DT<-data.table(id=c("Obs1","Obs2"),
    level.1=c(1,1),level.2=c(2,4),level.3=c(3,8),
    val.1=c(10,10),val.2=c(20,30),val.3=c(30,50))

DT
     id level.1 level.2 level.3 val.1 val.2 val.3
1: Obs1       1       2       3    10    20    30
2: Obs2       1       4       8    10    30    50

So if the lookup parameters:

params<-list("Obs1"=2.5,"Obs2"=1)

the values returned should be:

c(30,10).

I would also like the number of levels and values to be somewhat arbitrary, although they will satisfy a naming convention similar to the example

I can solve this using several steps, but it is very ugly and probably not very computationally efficient:

level.names<-colnames(DT)[grep("level",colnames(DT))]
val.names<-colnames(DT)[grep("val",colnames(DT))]
setkey(DT,id)

idx<-DT[,grep(TRUE,lapply(.SD,function(y)((params[[id]] <= y))))[1],
        .SDcols=level.names,by=id]

values<-ifelse(is.na(idx$V1),as.numeric(NA),DT[,get(val.names[idx[id,V1]]),by=id]$V1)

I previously solved this problem using data.frames much more cleanly, using plyr::ddply and the fact that I could use variable names for the columns in data.frame. (For brevity, I am not including that solution here.)

Any and all suggestions for improvement are welcome.

Answer 1

I'd do it using rolling joins as follows:

DT_m = melt(DT, measure=patterns("^level", "^val"), value.name=c("level", "val"))
query = list(id=c("Obs1", "Obs2"), level=c(2.5, 1))
DT_m[query, val, on=c("id", "level"), roll=-Inf]

roll=-Inf performs a NOCB join (next observation carried backward). When a value to join by (here, query ) falls in a gap, the next observation is carried backward as the matching row. For example, 2.5 falls between 2 and 4 . The matching row is therefore 4 (next observation). The corresponding val is 30 .

Answer 2

Here's one way:

mDT = melt(DT, measure.var = patterns("level","val"), value.name = c("level","val"))
setkey(mDT, id)

#      id variable level val
# 1: Obs1        1     1  10
# 2: Obs1        2     2  20
# 3: Obs1        3     3  30
# 4: Obs2        1     1  10
# 5: Obs2        2     4  30
# 6: Obs2        3     8  50

params2 <- list(id = c("Obs1","Obs2"), v=c(2.5,1)) 
mDT[params2,{
  i = findInterval(v, level, rightmost.closed=TRUE)
  val[ i + (v != level[i]) ]
}, by=.EACHI]

#      id V1
# 1: Obs1 30
# 2: Obs2 10

If you set params$v over the top level , NA will be returned:

params3 <- list(id = c("Obs1","Obs2"), v=c(5, 1)) 
mDT[params3, {i = findInterval(v, level, rightmost.closed=TRUE); val[ i + (v != level[i])]}, by=.EACHI]

#      id V1
# 1: Obs1 NA
# 2: Obs2 10

Comment. I think it's better to keep data in long/melted form than to play games with column names.

If you want to enter the parameters as key-value pairs, stack and setNames are helpful:

p0      = list(Obs1 = 1, Obs2 = 2.5)
params0 = setNames(stack(p0), c("v","id"))

R data.table ordered column lookup

Question

2 answers

solution1
5 ACCPTED 2015-11-10 02:29:54

solution2
2 2015-11-10 02:07:16

R data.table ordered column lookup

Question

2 answers

solution1 5 ACCPTED 2015-11-10 02:29:54

solution2 2 2015-11-10 02:07:16

solution1
5 ACCPTED 2015-11-10 02:29:54

solution2
2 2015-11-10 02:07:16