简体   繁体   English

data.table-应用值向量

[英]data.table - apply vector of values

I'm somewhat bogged down by this question. 这个问题让我有些困惑。 I have a data table of beta distribution parameters, each row in the data table corresponding to a relative probability of that distribution to represent the actual outcome. 我有一个包含beta分布参数的数据表,数据表中的每一行对应于该分布的相对概率,以代表实际结果。

I want to compute the cumulative distribution function for a number of sample values. 我想为多个样本值计算累积分布函数。 Using sapply, the code looks like this: 使用sapply,代码如下所示:

beta_dists <- data.table(data.frame(probs = c(0.4,0.3,0.3), a = c(0.0011952,0.001,0.00809), b = c(837,220,624), scale = c(1.5e9,115e6,1.5e6)))
xx <- seq(0,1.5e9,length = 2^12)

system.time(FX <- sapply(xx, function(x) (beta_dists[x < scale,.(FX = sum(probs * (1 - pbeta(x / scale, a, b))))])$FX))

However, that's quite slow and does not seem very elegant... Any thoughts on how to make this better? 但是,这很慢,而且看起来也不是很优雅...关于如何使它更好的任何想法?

Here is a suggestion to use a non-equi join by converting your xx into a data.table to be used in i : 这是通过将xx转换为要在i使用的data.table来使用非等价联接的建议:

ans <- beta_dists[dtx, on=.(scale > x), allow.cartesian=TRUE,
    sum(probs * (1 - pbeta(x / x.scale, a, b))), by=.EACHI]$V1

check: 校验:

#last element is NA in ans whereas its NULL in FX
identical(unlist(FX), head(ans$V1, -1))
#[1] TRUE

timing code: 计时码:

opmtd <- function() {
    sapply(xx, function(x) (beta_dists[x < scale,.(FX = sum(probs * (1 - pbeta(x / scale, a, b))))])$FX)
}

nonequiMtd <- function() {
    beta_dists[dtx, on=.(scale > x), allow.cartesian=TRUE, sum(probs * (1 - pbeta(x / x.scale, a, b))), by=.EACHI]   
}

vapplyMtd <- function() {
    dt[, res := vapply(x, f, 0)]
}

library(microbenchmark)
microbenchmark(opmtd(), nonequiMtd(), vapplyMtd(), times=3L)

timings: 定时:

Unit: milliseconds
         expr        min         lq       mean     median         uq        max neval
      opmtd() 2589.67889 2606.77795 2643.77975 2623.87700 2670.83018 2717.78336     3
 nonequiMtd()   19.59376   21.12739   22.28428   22.66102   23.62954   24.59805     3
  vapplyMtd() 1928.25841 1939.91866 1960.31181 1951.57891 1976.33852 2001.09812     3

data: 数据:

library(data.table)
beta_dists <- data.table(probs = c(0.4,0.3,0.3), a = c(0.0011952,0.001,0.00809), b = c(837,220,624), scale = c(1.5e9,115e6,1.5e6))
xx <- seq(0, 1.5e9, length = 2^12)
dtx <- data.table(x=xx)

My only idea is to do it the other way, that is to zip through a data table that contains your sample values: 我唯一的想法是用另一种方式做,即浏览包含您的样本值的数据表:

dt <- data.table(x = xx, res = 0)
f <- function(x) {
  beta_dists[x < scale, sum(probs * (1 - pbeta(x / scale, a, b)))]
}
system.time(dt[, res := vapply(x, f, 0)])

It seems to be slightly faster. 它似乎要快一些。 For instance, when I increased your sample size to 2^14, your original code ran on my machine for 7 seconds, but my proposed code did it in 5 seconds. 例如,当我将样本大小增加到2 ^ 14时,您的原始代码在我的计算机上运行了7秒钟,但是我建议的代码在5秒钟内完成了该操作。

I think the slowest part is the pbeta() function but I could be wrong. 我认为最慢的部分是pbeta()函数,但我可能是错的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM