简体   繁体   中英

data.table bug: lapply on .SD reorder columns when using get(). Possible workaround?

I found a strange behavior of data.table . I would like to know if there is a way to avoid it, or a workaround.

In my data management, I use often lapply with .SD , to assign new values to columns. To assign properly several columns, the order of the output column of the lapply must be kept. I found a situation where it is not the case.

Here the normal behavior

library(data.table)
plouf <- data.table(x = 1, y = 2, z = 3)
cols <- c("y","x")
plouf[,.SD,.SDcols = cols ,by = z]
plouf[,lapply(.SD,function(x){x}),.SDcols = cols ,by = z]
plouf[,lapply(.SD[x == 1],function(x){x}),.SDcols = cols ,by = z]

All these lines give :

   z y x
1: 3 2 1

which I need for example to reassign to c("y","x"). But if I do:

plouf[,lapply(.SD[get("x") == 1],function(x){x}),.SDcols = c("y","x"),by = z]

   z x y
1: 3 1 2

Here the order of x and y changed without reason, when it should yield the same result as the last "working" example. If then assign the wrong values to c("y","x") if I assign the output of lapply to new vector of columns. It seems that the use of get in the i part of .SD triggers this bug.

Example of the effect of this on assignment:

plouf[, c(cols ) := lapply(.SD[get("x") == 1],function(x){x}),
      .SDcols = cols ,by = z][]
#    x y z
# 1: 2 1 3

Does anyone have a workaround ? The code I am using looks more like :

 plouf[, c(cols ) := lapply(.SD[get("x") >= 1 & get("x") <= 3],function(x){mean}),
          .SDcols = cols ,by = z]

the issue on github: https://github.com/Rdatatable/data.table/issues/4089

Instead of subsetting .SD , you could do the subsetting in your lapply function. If the logical vector used for subsetting is passed as a third argument to lapply it isn't re-evaluated at each lapply pass.

Note: I changed the function to multiply by 10 since otherwise I couldn't tell if the code was doing anything at all

plouf[, (cols) := lapply(.SD, function(x, i) 10*mean(x[i]), 
                         get("x") %between% c(1, 3)), 
      .SDcols = cols ,by = z][]

#     x  y z
# 1: 10 20 3

There are other workarounds that would allow you to subset .SD, but I think subsetting .SD by group is slower than subsetting each column individually.

set.seed(0)
df <- rep(1:50000, sample(500:1000, 50000, T)) %>% 
        data.table(a = runif(length(.))
                  ,b = .)

library(microbenchmark)
microbenchmark(
  subSD = df[, lapply(.SD[a < .2], sum), b]
  , in_func = df[, lapply(.SD, function(x, i) sum(x[i]), a < .2), b]
  , times = 10L)

# Unit: milliseconds
#     expr      min         lq      mean     median        uq       max neval cld
#    subSD 19323.19 20398.3666 21289.345 20708.4346 22466.010 23738.467    10   b
#  in_func   972.64   987.7891  1016.252   995.4236  1038.069  1125.709    10  a 

Edit: bigger benchmark

set.seed(0)
rm(df)
df <- rep(1:5e5, sample(50:100, 5e5, T)) %>% 
        data.table(a = runif(length(.))
                  ,b = .)

library(microbenchmark)
microbenchmark(
  subSD = df[, lapply(.SD[a < .2], sum), b]
  , in_func = df[, lapply(.SD, function(x, i) sum(x[i]), a < .2), b]
  , times = 2L)

# Unit: seconds
#     expr        min         lq       mean     median        uq       max neval cld
#    subSD 207.111290 207.111290 214.147649 214.147649 221.18401 221.18401     2   b
#  in_func   3.560467   3.560467   3.651359   3.651359   3.74225   3.74225     2  a 

In the bug report of github, @jangoreki suggested:


As a workaround you can use now substitute rather than get

var = "x"
expr = substitute(
  plouf[, c(cols) := lapply(.SD[.var == 1],function(x){x}), .SDcols = cols, by = z][],
  list(.var=as.name(var))
)
print(expr)
#plouf[, `:=`(c(cols), lapply(.SD[x == 1], function(x) {
#    x
#})), .SDcols = cols, by = z][]
eval(expr)
#   x y z
#1: 2 1 3

Personally I would use it regularly, not as a workaround, I find R metaprogramming features superior. Also be aware that some day instead of get(var) we should be able to use ..var , see (#2816, #3199) R metaprogramming always worked and, I assume, will always work, thanks to the conservative backward compatible R code development.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM