R function 3D arrays 高效快速的应用

[英]Efficient and fast application of a function to 3D arrays in R

I have a very large 3D array (say 100 x 100 x 10) that I would like to apply a function over for pairwise comparisons.我有一个非常大的 3D 数组(比如 100 x 100 x 10),我想应用 function 进行成对比较。 I've tried a number of solutions, using data.table, mapply, etc. I'm maybe naively hoping for faster speedups, and am considering just doing this with C++/Rcpp.我已经尝试了很多解决方案,使用 data.table、mapply 等。我可能天真地希望更快的加速,并且正在考虑只用 C++/Rcpp 来做这件事。 But before doing that, I thought I'd see if anyone is aware of a more elegant / faster solution to this problem?但在这样做之前,我想我会看看是否有人知道这个问题的更优雅/更快的解决方案? Many thanks!非常感谢!

Example code in R. For this smaller dimension version of what I'm wanting to apply this to, mapply() is a little faster than data.table R 中的示例代码。对于我想要应用它的这个较小尺寸版本,mapply() 比 data.table 快一点

m <- 20 
n <- 10 # number of data points per row/col combination

R <- array(runif(n*m*m), dim=c(m,m,n)) # 3D array to apply function over
grid <- expand.grid(A = 1:m, B = 1:m, C = 1:m, D = 1:m) # array indices (used as args below)

#function to do basic correlations between R[1,2,] and R[1,10,]
ss2 <- function(a,b,c,d) {
  rho = cor(R[a, b, ], R[c, d, ])

#solution with data.table
dt <- setDT(grid) # convert from df -> dt
sol_1 <- dt[, ss2(A, B,C,D), by = seq_len(nrow(dt))]

#solution with mapply
sol_2 <- mapply(ss2, grid$A, grid$B, grid$C, grid$D)

I tried this with mapply(), data.table().我用 mapply()、data.table() 试过了。 I've also tried using a parellelized version of apply() (parApply, https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html )我也尝试过使用 apply() 的并行化版本(parApply, https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html

UPDATE: cora from the Rfast package gives further performance improvements.更新:来自cora Rfast的 cora 提供了进一步的性能改进。

By reshaping the array, we can use cor directly for a ~2K times speedup:通过重塑数组,我们可以直接使用cor进行约 2K 倍的加速:


m <- 20
n <- 10 # number of data points per row/col combination

R <- array(runif(n*m*m), dim=c(m,m,n)) # 3D array to apply function over
grid <- expand.grid(A = 1:m, B = 1:m, C = 1:m, D = 1:m)
ss2 <- function(a,b,c,d) rho = cor(R[a, b, ], R[c, d, ])
dt <- setDT(grid)

  sol_1 = dt[, ss2(A, B, C, D), by = seq_len(nrow(dt))][[2]],
  sol_2 = mapply(ss2, grid$A, grid$B, grid$C, grid$D),
  sol_3 = c(cor(t(matrix(R, m*m, n)))),
  sol_4 = c(cora(t(matrix(R, m*m, n)))),
  check = "equal",
  times = 10
#> Unit: microseconds
#>   expr       min        lq       mean    median        uq       max neval
#>  sol_1 2101327.2 2135311.0 2186922.33 2178526.6 2247049.6 2301429.5    10
#>  sol_2 2255828.9 2266427.5 2306180.23 2287911.0 2321609.6 2471711.7    10
#>  sol_3    1203.8    1222.2    1244.75    1236.1    1243.9    1343.5    10
#>  sol_4     922.6     945.8     952.68     951.9     955.8     988.8    10

Timing the full 100 x 100 x 10 array:对完整的 100 x 100 x 10 阵列进行计时:

m <- 100L
n <- 10L
R <- array(runif(n*m*m), dim=c(m,m,n))

  sol_3 = c(cor(t(matrix(R, m*m, n)))),
  sol_4 = c(cora(t(matrix(R, m*m, n)))),
  check = "equal",
  times = 10
#> Unit: milliseconds
#>   expr       min        lq     mean   median       uq      max neval
#>  sol_3 1293.0739 1298.4997 1466.546 1503.453 1513.746 1902.802    10
#>  sol_4  879.8659  892.2699 1058.064 1055.668 1143.767 1300.282    10

Note that filling by column then transposing tends to be slightly faster than filling by row in this case .请注意, 在这种情况下,按列填充然后转置往往比按行填充稍快。 Also note that ss2 and grid are no longer needed.另请注意,不再需要ss2grid

