Please consider this
library(data.table)
mydt <-
data.table(id = 1:100,
p1 = sample(seq(0,1,length.out=1000),100))
mydt$p2 <- 1 - mydt$p1
I want to apply a function using as the argument a vector from columns p1
and p2
.
myFun <- function(x) {
sample(c(1,2), 1, prob = x)
}
This works,
mydt$outcome <- apply(mydt[,2:3], 1, myFun)
but I have a 25M rows, so I reach the memory limit.
I tried this, but it doesn't work.
mydt[,mydt := mapply(myFun, p1, p2)]
prob
argument in sample
requires a vector. And to apply myFun
to each row, you can use by=1:nrow(mydt)
or by=1:mydt[,.N]
mydt[, chosen := myFun(c(p1, p2)), by=1:nrow(mydt)]
Hat-tip to @Roland for his usage of rbinom
. His vectorized version for this Bernoulli trial is much faster.
> system.time(mydt[, chosen := myFun(c(p1, p2)), by=1:nrow(mydt)])
user system elapsed
4.82 0.00 4.86
> system.time(mydt[, outcome2 := rbinom(.N, 1, p2) + 1])
user system elapsed
0.05 0.02 0.06
data used in timings:
library(data.table)
set.seed(0L)
m <- 1e6
mydt <- data.table(id = 1:m, p1 = runif(m))[, p2 := 1 - p1]
myFun <- function(x) sample(c(1,2), 1, prob = x)
accuracy check:
n <- 0L
while (n < 1e3) {
set.seed(n)
mydt[, chosen := myFun(c(p1, p2)), by=1:nrow(mydt)]
set.seed(n)
mydt[, outcome2 := rbinom(.N, 1, p2) + 1]
if(!all.equal(mydt$chosen, mydt$outcome2)) stop("mismatch")
n <- n + 1
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.