简体   繁体   中英

Reindexing in R with data.table?

Here is a frequent problem i've come across recently in R with data.table

I have an index table, say DT1. The column x would be a subset of indexes. I would be working with a subtable of a bigger rawtable using these indexes. the subtable typically would be indexed from 1 to N. (thats the column y)

then for example, i would come across a table with pairs of indexes with the original indexing, and i want to know the corresponding new indexing.

Here's what it looks like

DT1 <- data.table(x=c(0,3,5),y= c(11,22,33))
DT2 <- data.table(x=c(3,3,0,0,5),x=c(0,5,0,3,5))
# > DT1
#    x y
# 1: 0 11
# 2: 3 22
# 3: 5 33

# > DT2
#    x x
# 1: 3 0
# 2: 3 5
# 3: 0 0
# 4: 0 3
# 5: 5 5

Here is a tortuous way i found

cbind(DT1[DT2[,1,with=FALSE],on="x"][,2,with=FALSE],DT1[DT2[,2,with=FALSE],on="x"][,2,with=FALSE])
#     y  y
# 1: 22 11
# 2: 22 33
# 3: 11 11
# 4: 11 22
# 5: 33 33

a more basic way to do this with sapply gives the same result

tab=DT1$x
lookup <- function(value){DT1$y[which(tab==value)]}

colnames(DT2) <- c("x","xx")

ans <- as.data.table(cbind(sapply(DT2$x,lookup),sapply(DT2$xx,lookup)))
colnames(ans) <- c("y","y")

However, the first solution looks a bit ugly to me

I don't like the second one, because I need to define assign a value to tab each time i use the function lookup in lapply . if I had to lookup in different tables, i would have the choice either to create a new lookup function specific to that table, or to store in memory in a (temp) variable tab. perhaps there is a way to do the lapply with a function of two variables lookup <-function (tab,value) {...} ? that i don't know

I'm sure there are many other ways. I'm not sure what i'm doing exactly with the first solution. basically the syntax in data.table has to do with (inner and outer) JOINS. but in the final output, i want to keep the original order of the table DT2. setting column x as a key for DT2 would sort that column, making stuff like merge being not adapted to that ?

I'm willing to hear from you what is the best implementation, - i'm sure there are many better ones - and also , the most efficient when dealing with very very large tables.

The idiomatic data.table approach would be to update DT2 while joining as follows:

require(data.table) # v1.9.6
setnames(DT2, c("a", "b")) # no duplicate names!!
for (nm in names(DT2)) {
    DT2[DT1, paste0(nm, ".val") := y, on = structure("x", names=nm)]
}
DT2[]
#    a b a.val b.val
# 1: 3 0    22    11
# 2: 3 5    22    33
# 3: 0 0    11    11
# 4: 0 3    11    22
# 5: 5 5    33    33

You can hide the loop with lapply() perhaps. If DT2 was instead as follows (in long form; see DT3 ):

DT3 = melt(DT2, measure = c("a", "b"), variable.name = "id", value.name = "x.val")

then you could do:

DT3[DT1, y.val := y, on = c(x.val = "x")] 

You can use y.val := iy to be more explicit that you're referring to the y column from the data.table corresponding to i argument.. (useful when they both have common column names).

You can try something like the following:

DT2[, lapply(.SD, function(x) DT1[["y"]][match(x, DT1[["x"]])])]
#     x  x
# 1: 22 11
# 2: 22 33
# 3: 11 11
# 4: 11 22
# 5: 33 33
str(.Last.value)
# Classes ‘data.table’ and 'data.frame':    5 obs. of  2 variables:
#  $ x: num  22 22 11 11 33
#  $ x: num  11 33 11 22 33
#  - attr(*, ".internal.selfref")=<externalptr> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM