简体   繁体   中英

R data.table ordered column lookup

I have an R data.table with an id column and multiple columns specifying an ordered threshold level and a corresponding value. What I would like to do is lookup for each row the first level which is greater or equal than the parameter for that id and return the corresponding value.

Here is an example data set.

DT<-data.table(id=c("Obs1","Obs2"),
    level.1=c(1,1),level.2=c(2,4),level.3=c(3,8),
    val.1=c(10,10),val.2=c(20,30),val.3=c(30,50))

DT
     id level.1 level.2 level.3 val.1 val.2 val.3
1: Obs1       1       2       3    10    20    30
2: Obs2       1       4       8    10    30    50

So if the lookup parameters:

params<-list("Obs1"=2.5,"Obs2"=1) 

the values returned should be:

c(30,10).

I would also like the number of levels and values to be somewhat arbitrary, although they will satisfy a naming convention similar to the example

I can solve this using several steps, but it is very ugly and probably not very computationally efficient:

level.names<-colnames(DT)[grep("level",colnames(DT))]
val.names<-colnames(DT)[grep("val",colnames(DT))]
setkey(DT,id)

idx<-DT[,grep(TRUE,lapply(.SD,function(y)((params[[id]] <= y))))[1],
        .SDcols=level.names,by=id]

values<-ifelse(is.na(idx$V1),as.numeric(NA),DT[,get(val.names[idx[id,V1]]),by=id]$V1)

I previously solved this problem using data.frames much more cleanly, using plyr::ddply and the fact that I could use variable names for the columns in data.frame. (For brevity, I am not including that solution here.)

Any and all suggestions for improvement are welcome.

I'd do it using rolling joins as follows:

DT_m = melt(DT, measure=patterns("^level", "^val"), value.name=c("level", "val"))
query = list(id=c("Obs1", "Obs2"), level=c(2.5, 1))
DT_m[query, val, on=c("id", "level"), roll=-Inf]

roll=-Inf performs a NOCB join (next observation carried backward). When a value to join by (here, query ) falls in a gap, the next observation is carried backward as the matching row. For example, 2.5 falls between 2 and 4 . The matching row is therefore 4 (next observation). The corresponding val is 30 .

Here's one way:

mDT = melt(DT, measure.var = patterns("level","val"), value.name = c("level","val"))
setkey(mDT, id)

#      id variable level val
# 1: Obs1        1     1  10
# 2: Obs1        2     2  20
# 3: Obs1        3     3  30
# 4: Obs2        1     1  10
# 5: Obs2        2     4  30
# 6: Obs2        3     8  50

params2 <- list(id = c("Obs1","Obs2"), v=c(2.5,1)) 
mDT[params2,{
  i = findInterval(v, level, rightmost.closed=TRUE)
  val[ i + (v != level[i]) ]
}, by=.EACHI]

#      id V1
# 1: Obs1 30
# 2: Obs2 10

If you set params$v over the top level , NA will be returned:

params3 <- list(id = c("Obs1","Obs2"), v=c(5, 1)) 
mDT[params3, {i = findInterval(v, level, rightmost.closed=TRUE); val[ i + (v != level[i])]}, by=.EACHI]

#      id V1
# 1: Obs1 NA
# 2: Obs2 10

Comment. I think it's better to keep data in long/melted form than to play games with column names.

If you want to enter the parameters as key-value pairs, stack and setNames are helpful:

p0      = list(Obs1 = 1, Obs2 = 2.5)
params0 = setNames(stack(p0), c("v","id"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM