I found a strange behavior of data.table
. I would like to know if there is a way to avoid it, or a workaround.
In my data management, I use often lapply
with .SD
, to assign new values to columns. To assign properly several columns, the order of the output column of the lapply
must be kept. I found a situation where it is not the case.
Here the normal behavior
library(data.table)
plouf <- data.table(x = 1, y = 2, z = 3)
cols <- c("y","x")
plouf[,.SD,.SDcols = cols ,by = z]
plouf[,lapply(.SD,function(x){x}),.SDcols = cols ,by = z]
plouf[,lapply(.SD[x == 1],function(x){x}),.SDcols = cols ,by = z]
All these lines give :
z y x
1: 3 2 1
which I need for example to reassign to c("y","x"). But if I do:
plouf[,lapply(.SD[get("x") == 1],function(x){x}),.SDcols = c("y","x"),by = z]
z x y
1: 3 1 2
Here the order of x and y changed without reason, when it should yield the same result as the last "working" example. If then assign the wrong values to c("y","x")
if I assign the output of lapply
to new vector of columns. It seems that the use of get
in the i
part of .SD
triggers this bug.
Example of the effect of this on assignment:
plouf[, c(cols ) := lapply(.SD[get("x") == 1],function(x){x}),
.SDcols = cols ,by = z][]
# x y z
# 1: 2 1 3
Does anyone have a workaround ? The code I am using looks more like :
plouf[, c(cols ) := lapply(.SD[get("x") >= 1 & get("x") <= 3],function(x){mean}),
.SDcols = cols ,by = z]
the issue on github: https://github.com/Rdatatable/data.table/issues/4089
Instead of subsetting .SD
, you could do the subsetting in your lapply function. If the logical vector used for subsetting is passed as a third argument to lapply it isn't re-evaluated at each lapply pass.
Note: I changed the function to multiply by 10 since otherwise I couldn't tell if the code was doing anything at all
plouf[, (cols) := lapply(.SD, function(x, i) 10*mean(x[i]),
get("x") %between% c(1, 3)),
.SDcols = cols ,by = z][]
# x y z
# 1: 10 20 3
There are other workarounds that would allow you to subset .SD, but I think subsetting .SD
by group is slower than subsetting each column individually.
set.seed(0)
df <- rep(1:50000, sample(500:1000, 50000, T)) %>%
data.table(a = runif(length(.))
,b = .)
library(microbenchmark)
microbenchmark(
subSD = df[, lapply(.SD[a < .2], sum), b]
, in_func = df[, lapply(.SD, function(x, i) sum(x[i]), a < .2), b]
, times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# subSD 19323.19 20398.3666 21289.345 20708.4346 22466.010 23738.467 10 b
# in_func 972.64 987.7891 1016.252 995.4236 1038.069 1125.709 10 a
Edit: bigger benchmark
set.seed(0)
rm(df)
df <- rep(1:5e5, sample(50:100, 5e5, T)) %>%
data.table(a = runif(length(.))
,b = .)
library(microbenchmark)
microbenchmark(
subSD = df[, lapply(.SD[a < .2], sum), b]
, in_func = df[, lapply(.SD, function(x, i) sum(x[i]), a < .2), b]
, times = 2L)
# Unit: seconds
# expr min lq mean median uq max neval cld
# subSD 207.111290 207.111290 214.147649 214.147649 221.18401 221.18401 2 b
# in_func 3.560467 3.560467 3.651359 3.651359 3.74225 3.74225 2 a
In the bug report of github, @jangoreki suggested:
As a workaround you can use now substitute rather than get
var = "x"
expr = substitute(
plouf[, c(cols) := lapply(.SD[.var == 1],function(x){x}), .SDcols = cols, by = z][],
list(.var=as.name(var))
)
print(expr)
#plouf[, `:=`(c(cols), lapply(.SD[x == 1], function(x) {
# x
#})), .SDcols = cols, by = z][]
eval(expr)
# x y z
#1: 2 1 3
Personally I would use it regularly, not as a workaround, I find R metaprogramming features superior. Also be aware that some day instead of
get(var)
we should be able to use..var
, see (#2816, #3199) R metaprogramming always worked and, I assume, will always work, thanks to the conservative backward compatible R code development.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.