简体   繁体   English

如何在R并行计算中使用Reduce()函数?

[英]How to use Reduce() function in R parallel computing?

I want to run a Reduce code to out1 a list of 66000 list elements: 我想运行一个Reduce代码以out1个列表元素的列表:

trialStep1_done <- Reduce(rbind, out1)

However, it takes too long to run. 但是,运行时间太长。 I wonder whether I can run this code with help of a parallel computing package. 我想知道是否可以借助并行计算程序包运行此代码。

I know there is mclapply , mcMap , but I don't see any function like mcReduce in parallel computing package. 我知道这里有mclapplymcMap ,但是在并行计算包中没有看到像mcReduce这样的函数。

Is there a function like mcReduce available for doing Reduce with parallel in R to complete the task I wanted to do? 是否有类似mcReduce的功能可用于在R中使用并行执行Reduce来完成我想做的任务?

Thanks a lot @BrodieG and @zheYuan Li, your answers are very helpful. 非常感谢@BrodieG和@zheYuan Li,您的回答非常有帮助。 I think the following code example can represent my question with more precision: 我认为以下代码示例可以更精确地表示我的问题:

df1 <- data.frame(a=letters, b=LETTERS, c=1:26 %>% as.character())
set.seed(123)
df2 <- data.frame(a=letters %>% sample(), b=LETTERS %>% sample(), c=1:26 %>% sample() %>% as.character())
set.seed(1234)
df3 <- data.frame(a=letters %>% sample(), b=LETTERS %>% sample(), c=1:26 %>% sample() %>% as.character())
out1 <- list(df1, df2, df3)

# I don't know how to rbind() the list elements only using matrix()
# I have to use lapply() and Reduce() or do.call()
out2 <- lapply(out1, function(x) matrix(unlist(x), ncol = length(x), byrow = F))

Reduce(rbind, out2)
do.call(rbind, out2)
# One thing is sure is that `do.call()` is super faster than `Reduce()`, @BordieG's answer helps me understood why. 

So, at this point, to my 200000 rows dataset, do.call() solves the problem very well. 因此,至此,对于我的200000行数据集, do.call()很好地解决了这个问题。

Finally, I wonder whether this is an even faster way? 最后,我想知道这是否是更快的方法? or the way @ZheYuanLi demostrated with just matrix() could be possible here? 还是在这里只用matrix()演示@ZheYuanLi的方式?

The problem is not rbind , the problem is Reduce . 问题不是rbind ,问题是Reduce Unfortunately, function calls in R are expensive, and particularly so when you keep creating new objects. 不幸的是,R中的函数调用非常昂贵,尤其是当您继续创建新对象时。 In this case, you call rbind 65999 times, and each time you do you create a new R object with one row added. 在这种情况下,您调用rbind 65999次,并且每次创建一个新的R对象并添加一行。 Instead, you can just call rbind once with 66000 arguments, which will be much faster since internally rbind will do the binding in C without having to call R functions 66000 times and allocating the memory just once. 相反,您只能使用66000个参数调用rbind一次,这将更快,因为内部rbind可以在C中进行绑定,而不必调用R函数66000次并仅分配一次内存。 Here we compare your Reduce use with Zheyuan's matrix/unlist and finally with rbind called once with do.call ( do.call allows you to call a function with all arguments specified as a list): 在这里,我们将您的Reduce使用与Zheyuan的矩阵/未列表进行比较,最后将rbinddo.call调用一次( do.call允许您以指定为列表的所有参数调用函数):

out1 <- replicate(1000, 1:20, simplify=FALSE)  # use 1000 elements for illustrative purposes

library(microbenchmark)    
microbenchmark(times=10,
  a <- do.call(rbind, out1),
  b <- matrix(unlist(out1), ncol=20, byrow=TRUE),
  c <- Reduce(rbind, out1)
)
# Unit: microseconds
#                                                expr        min         lq
#                           a <- do.call(rbind, out1)    469.873    479.815
#  b <- matrix(unlist(out1), ncol = 20, byrow = TRUE)    257.263    260.479
#                            c <- Reduce(rbind, out1) 110764.898 113976.376
all.equal(a, b, check.attributes=FALSE)
# [1] TRUE
all.equal(b, c, check.attributes=FALSE)
# [1] TRUE

Zheyuan is the fastest, but for all intents and purposes the do.call(rbind()) method is pretty similar. Zheyuan是最快的,但是就所有意图和目的而言, do.call(rbind())方法都非常相似。

  1. It is slow, because you repeatedly call rbind . 这很慢,因为您反复调用rbind Every time it is called, new memory allocation has to be done as the object's dimension is increasing. 每次调用时,都必须随着对象尺寸的增加而进行新的内存分配。
  2. Your work is memory-bound, and you are not going to benefit from parallelism. 您的工作受内存限制,您将无法从并行性中受益。 On a multi-core machine, parallel processing is only useful for CPU-bound tasks. 在多核计算机上,并行处理仅对CPU绑定任务有用。

If I did not get you wrong, you should probably use this: 如果我没有弄错您,则可能应该使用以下命令:

trialStep1_done <- matrix(unlist(out1), nrow = length(out1), byrow = TRUE)

Example: 例:

out1 <- list(1:4, 11:14, 21:24, 31:34)

#> str(out1)
#List of 4
# $ : int [1:4] 1 2 3 4
# $ : int [1:4] 11 12 13 14
# $ : int [1:4] 21 22 23 24
# $ : int [1:4] 31 32 33 34

trialStep1_done <- matrix(unlist(out1), nrow = length(out1), byrow = TRUE)

#> trialStep1_done
#     [,1] [,2] [,3] [,4]
#[1,]    1    2    3    4
#[2,]   11   12   13   14
#[3,]   21   22   23   24
#[4,]   31   32   33   34

Thanks for @BrodieG's excellent explanation and benchmarking result! 感谢@BrodieG出色的解释和基准测试结果!

I tried the benchmarking on my laptop as well, using exactly the same code as @BrodieG's, and this is what I get: 我也使用与@BrodieG完全相同的代码在笔记本电脑上尝试了基准测试,这就是我得到的:

Unit: microseconds
                                               expr      min       lq      mean
                          a <- do.call(rbind, out1)   653.60   670.36   900.120
 b <- matrix(unlist(out1), ncol = 20, byrow = TRUE)   170.16   177.60   224.036
                           c <- Reduce(rbind, out1) 65589.48 67519.32 72317.812
   median       uq       max neval
   745.54   832.36   2352.28    10
   183.98   286.84    385.96    10
 68897.36 69372.88 108135.96    10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM