[英]Efficient and fast application of a function to 3D arrays in R
I have a very large 3D array (say 100 x 100 x 10) that I would like to apply a function over for pairwise comparisons.我有一个非常大的 3D 数组(比如 100 x 100 x 10),我想应用 function 进行成对比较。 I've tried a number of solutions, using data.table, mapply, etc. I'm maybe naively hoping for faster speedups, and am considering just doing this with C++/Rcpp.我已经尝试了很多解决方案,使用 data.table、mapply 等。我可能天真地希望更快的加速,并且正在考虑只用 C++/Rcpp 来做这件事。 But before doing that, I thought I'd see if anyone is aware of a more elegant / faster solution to this problem?但在这样做之前,我想我会看看是否有人知道这个问题的更优雅/更快的解决方案? Many thanks!非常感谢!
Example code in R. For this smaller dimension version of what I'm wanting to apply this to, mapply() is a little faster than data.table R 中的示例代码。对于我想要应用它的这个较小尺寸版本,mapply() 比 data.table 快一点
m <- 20
n <- 10 # number of data points per row/col combination
R <- array(runif(n*m*m), dim=c(m,m,n)) # 3D array to apply function over
grid <- expand.grid(A = 1:m, B = 1:m, C = 1:m, D = 1:m) # array indices (used as args below)
#function to do basic correlations between R[1,2,] and R[1,10,]
ss2 <- function(a,b,c,d) {
rho = cor(R[a, b, ], R[c, d, ])
}
#solution with data.table
dt <- setDT(grid) # convert from df -> dt
sol_1 <- dt[, ss2(A, B,C,D), by = seq_len(nrow(dt))]
#solution with mapply
sol_2 <- mapply(ss2, grid$A, grid$B, grid$C, grid$D)
I tried this with mapply(), data.table().我用 mapply()、data.table() 试过了。 I've also tried using a parellelized version of apply() (parApply, https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html )我也尝试过使用 apply() 的并行化版本(parApply, https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html )
UPDATE: cora
from the Rfast
package gives further performance improvements.更新:来自cora
Rfast
的 cora 提供了进一步的性能改进。
By reshaping the array, we can use cor
directly for a ~2K times speedup:通过重塑数组,我们可以直接使用cor
进行约 2K 倍的加速:
library(data.table)
library(Rfast)
m <- 20
n <- 10 # number of data points per row/col combination
R <- array(runif(n*m*m), dim=c(m,m,n)) # 3D array to apply function over
grid <- expand.grid(A = 1:m, B = 1:m, C = 1:m, D = 1:m)
ss2 <- function(a,b,c,d) rho = cor(R[a, b, ], R[c, d, ])
dt <- setDT(grid)
microbenchmark::microbenchmark(
sol_1 = dt[, ss2(A, B, C, D), by = seq_len(nrow(dt))][[2]],
sol_2 = mapply(ss2, grid$A, grid$B, grid$C, grid$D),
sol_3 = c(cor(t(matrix(R, m*m, n)))),
sol_4 = c(cora(t(matrix(R, m*m, n)))),
check = "equal",
times = 10
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> sol_1 2101327.2 2135311.0 2186922.33 2178526.6 2247049.6 2301429.5 10
#> sol_2 2255828.9 2266427.5 2306180.23 2287911.0 2321609.6 2471711.7 10
#> sol_3 1203.8 1222.2 1244.75 1236.1 1243.9 1343.5 10
#> sol_4 922.6 945.8 952.68 951.9 955.8 988.8 10
Timing the full 100 x 100 x 10 array:对完整的 100 x 100 x 10 阵列进行计时:
m <- 100L
n <- 10L
R <- array(runif(n*m*m), dim=c(m,m,n))
microbenchmark::microbenchmark(
sol_3 = c(cor(t(matrix(R, m*m, n)))),
sol_4 = c(cora(t(matrix(R, m*m, n)))),
check = "equal",
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> sol_3 1293.0739 1298.4997 1466.546 1503.453 1513.746 1902.802 10
#> sol_4 879.8659 892.2699 1058.064 1055.668 1143.767 1300.282 10
Note that filling by column then transposing tends to be slightly faster than filling by row in this case .请注意, 在这种情况下,按列填充然后转置往往比按行填充稍快。 Also note that ss2
and grid
are no longer needed.另请注意,不再需要ss2
和grid
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.