简体   繁体   English

R - 加速数组维度上的循环

[英]R - Speeding up loop over array dimensions

I am working with an array with dimensions我正在处理一个具有维度的数组

[1] 290 259  55   4

For each repetition of the last three dimensions, I want to perform a rolling mean on the 290 elements of the first dimension, reducing the number of elements to 289. Finally, I need to create a data frame with the updated values.对于最后三个维度的每次重复,我想对第一个维度的 290 个元素执行滚动平均值,将元素数量减少到 289 个。最后,我需要创建一个包含更新值的数据框。

The following code achieves what I need, but it takes a LONG time to run (actually, I have to interrupt it before the end).下面的代码实现了我所需要的,但是需要很长时间才能运行(实际上,我必须在结束之前中断它)。

library(zoo)

# Generate random data with same dimensions as mine
my.array <- array(1:16524200, dim=c(290,259,55,4))

# Get dimension sizes
dim2 <- dim(my.array)[2]
dim3 <- dim(my.array)[3]
dim4 <- dim(my.array)[4]

# Pre-allocate data frame to be used within the loop
df2 <- data.frame()

# Loop over dimensions
for (i in 1:dim4) {
  for (j in 1:dim3) {
    for (k in 1:dim2) {

      # Take rolling average
      u <- rollapply(my.array[,k,j,i], 2, mean)

      # Assemble data frame
      df1 <- data.frame(time=i, level=j, lat=k, wind=u)
      df2 <- rbind(df2, df1)

    }
  }
}
# Very slow, and uses only one machine core

I feel like it is possible to improve the processing time of this code by using vectorization or even some kind of parallelism, but I can't figure out how.我觉得可以通过使用矢量化甚至某种并行性来改善这段代码的处理时间,但我不知道如何。

Any suggestions to make this code more efficient?有什么建议可以使此代码更有效吗?

apply() works on any number of dimensions so you can achieve the same result much more quickly using the following wrapped in as.data.frame.table() to efficiently convert the output from an array to a data frame: apply()适用于任意数量的维度,因此您可以使用包装在as.data.frame.table()的以下内容更快地获得相同的结果,以有效地将输出从数组转换为数据框:

library(zoo)
df <- as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))

Not strictly necessary but this can be tidied up to match your original output:并非绝对必要,但这可以整理以匹配您的原始输出:

idx <- sapply(df, is.factor)
df[idx] <- sapply(df[idx], as.integer)

df <- setNames(df[c(4,3,2,5)], c("time", "level", "lat", "wind"))

Check if the result is the same:检查结果是否相同:

identical(df2, df)
[1] TRUE

Up front, you are suffering from the 2nd circle of R's Inferno ( https://www.burns-stat.com/pages/Tutor/R_inferno.pdf ): growing objects.在前面,您正在遭受 R 的地狱 ( https://www.burns-stat.com/pages/Tutor/R_inferno.pdf ) 的第二个圈子:生长对象。 Each time you call rbind , it makes a complete copy of the frame, does the r-binding, then overwrites that complete copy over the original variable name.每次调用rbind ,它都会制作框架的完整副本,进行 r 绑定,然后覆盖原始变量名称的完整副本。 So while it might work without noticeable slow-down for the first few dozen, it will slow down a bit over 100 or so ... and you're doing it 56,980 times.因此,虽然它可能在前几十次没有明显减速的情况下工作,但它会减速超过 100 次左右……并且您正在执行 56,980 次。

It is generally much better to process things into a list and then do the rbind once at the end on the entire list, as in do.call(rbind, list_of_frames) .通常最好将事物处理成一个list ,然后在整个列表的末尾执行rbind一次,如do.call(rbind, list_of_frames) Granted, you still may have the computational challenge of doing something potentially hard ... luckily zoo is about as efficient as you can get for window operations, and this one is not impossibly hard.诚然,您仍然可能会遇到一些可能很难的计算挑战……幸运的是, zoo效率与您在窗口操作中所能获得的效率差不多,而这并不是不可能的困难。

I'll demonstrate on a significantly-reduced problem set (since I don't think it matters if we're looking at 16M or 1.5M iterations.我将在一个显着减少的问题集上进行演示(因为我认为如果我们查看 16M 或 1.5M 迭代并不重要。

my.array <- array(1:1502200, dim=c(290,259,5,4))
eg <- do.call(expand.grid, lapply(dim(my.array)[-1], seq_len))
dim(eg)
# [1] 5180    3
head(eg)
#   Var1 Var2 Var3
# 1    1    1    1
# 2    2    1    1
# 3    3    1    1
# 4    4    1    1
# 5    5    1    1
# 6    6    1    1

system.time({
  list_of_frames <- Map(function(i,j,k) {
    u <- zoo::rollapply(my.array[,i,j,k], 2, mean)
    data.frame(i, j, k, wind = u)
  }, eg[[1]], eg[[2]], eg[[3]])
})
#    user  system elapsed 
#    5.79    0.00    5.80 
head(list_of_frames[[5]])
#   i j k   wind
# 1 5 1 1 1161.5
# 2 5 1 1 1162.5
# 3 5 1 1 1163.5
# 4 5 1 1 1164.5
# 5 5 1 1 1165.5
# 6 5 1 1 1166.5

system.time({
  out <- do.call(rbind, list_of_frames)
})
#    user  system elapsed 
#    0.50    0.03    0.53 
nrow(out)
# [1] 1497020
rbind(head(out), tail(out))
#           i j k      wind
# 1         1 1 1       1.5
# 2         1 1 1       2.5
# 3         1 1 1       3.5
# 4         1 1 1       4.5
# 5         1 1 1       5.5
# 6         1 1 1       6.5
# 1497015 259 5 4 1502194.5
# 1497016 259 5 4 1502195.5
# 1497017 259 5 4 1502196.5
# 1497018 259 5 4 1502197.5
# 1497019 259 5 4 1502198.5
# 1497020 259 5 4 1502199.5

Explanation:解释:

  • do.call(expand.grid, ...) is creating a frame of all the i,j,k combinations you need, dynamically on the dimensions of your array. do.call(expand.grid, ...)正在创建一个包含您需要的所有i,j,k组合的框架,动态地在您的数组的维度上。
  • Map(f, is, js, ks) runs the function f with the 1st argument of each of is , js , and ks (notional for this bullet), so Map looks something like: Map(f, is, js, ks)使用isjsks的第一个参数运行函数f (此项目符号的概念),因此 Map 看起来像:

     f(is[1], js[1], ks[1]) f(is[2], js[2], ks[2]) f(is[3], js[3], ks[3]) # ...
  • then we combine them in one call using do.call(rbind, ...) .然后我们使用do.call(rbind, ...)将它们组合在一个调用中。 We really have to use do.call here because this call is analogous to我们真的必须在这里使用do.call因为这个调用类似于

    rbind(list_of_frames[[1]], list_of_frames[[2]], ..., list_of_frames[[5180]])

    (over to you if you'd prefer to write out this version). (如果您想写出此版本,则交给您)。

Another option to flatten the multidimensional array first before using data.table to calculate the rolling mean在使用data.table计算滚动平均值之前先展平多维数组的另一种选择

library(data.table)
system.time({
    ans <- setDT(as.data.frame.table(my.array))[
        , .(wind=((Freq + shift(Freq)) / 2)[-1L]), 
        .(time=Var4, level=Var3, lat=Var2)]
    cols <- c("time", "level", "lat")
    ans[, (cols) := lapply(.SD, function(x) match(x, unique(x))), .SDcols=cols]
})
ans

output:输出:

          time level lat       wind
       1:    1     1   1        1.5
       2:    1     1   1        2.5
       3:    1     1   1        3.5
       4:    1     1   1        4.5
       5:    1     1   1        5.5
      ---                          
16467216:    4    55 259 16524195.5
16467217:    4    55 259 16524196.5
16467218:    4    55 259 16524197.5
16467219:    4    55 259 16524198.5
16467220:    4    55 259 16524199.5

timings:时间:

   user  system elapsed 
   4.90    1.16    5.66 

and for comparison:并进行比较:

library(zoo)
system.time({
    as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))  
})
#   user  system elapsed 
#  21.89    0.63   22.51 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM