[英]How to write efficient nested functions for parallelization?
I have a dataframe with two grouping variables class
and group
.我有一个带有两个分组变量class
和group
的数据框。 For each class, I have a plotting task per group.对于每个班级,我每个小组都有一个绘图任务。 Mostly, I have 2 levels per class
and 500 levels per group
.大多数情况下,我必须每2级class
,每个500个级别group
。
I'm using parallel
package for parallelization and mclapply
function for the iteration through class
and group
levels.我正在使用parallel
包进行并行化,并使用mclapply
函数通过class
和group
级别进行迭代。
I'm wondering which is the best way to write my iterations.我想知道哪种方法是编写我的迭代的最佳方式。 I think I have two options:我想我有两个选择:
class
variable.为class
变量运行并行化。group
variable.对group
变量运行并行化。 My computer has 3 cores working for R session and usuarlly, preserve the 4th core for my Operating System.我的计算机有 3 个内核用于 R 会话,通常为我的操作系统保留第 4 个内核。 I was wondering that if perform the parallelization for class
variable with 2 levels, the 3rd core will never will be used, so I thought that would be more efficient ensuring all 3 cores will be working running the parallelization for group
variable.我想知道如果对具有 2 个级别的class
变量执行并行化,将永远不会使用第 3 个核心,所以我认为确保所有 3 个核心都将运行group
变量的并行化会更有效。 I've written some speed tests to be sure wich is the best way:我已经编写了一些速度测试,以确保这是最好的方法:
library(microbenchmark)
library(parallel)
f = function(class, group, A, B) {
mclapply(seq(class), mc.cores = A, function(z) {
mclapply(seq(group), mc.cores = B, function(c) {
ifelse(class == 1, 'plotA', 'plotB')
})
})
}
class = 2
group = 500
microbenchmark(
up = f(class, group, 3, 1),
nest = f(class, group, 1, 3),
times = 50L
)
Unit: milliseconds
expr min lq mean median uq max neval
up 6.751193 7.897118 10.89985 9.769894 12.26880 26.87811 50
nest 16.584382 18.999863 25.54437 22.293591 28.60268 63.49878 50
Result tells that I shoud use the parallelization for class
and not for group
variable.结果告诉我应该对class
而不是group
变量使用并行化。
The overview would be that I allways shoud write one-core functions and then call it for parallelization.概述是我总是应该编写单核函数,然后调用它进行并行化。 I think this way, my code would be more simple or reductionist, than write nested functions with parallelization capabilities.我认为这样,我的代码会比编写具有并行化功能的嵌套函数更简单或更简化。
The ifelse
condition is used because the previous code used to prepare the data for plotting task is more or less redundant for both class
levels, so I thought it would be more line-coding efficient write a longer function checking which class
level is used than "splitting" this function in two shorter functions.使用ifelse
条件是因为之前用于准备绘图任务数据的代码对于两个class
级别或多或少都是多余的,所以我认为编写一个更长的函数来检查使用哪个class
级别比“将此功能拆分为两个较短的功能。
Which is the best practice to write this kind of code?.编写这种代码的最佳做法是什么? I seams clear, but because I'm not an expert data-scientist, I would like to know your working approach.我很清楚,但因为我不是专业的数据科学家,我想知道你的工作方法。
This threat is around this problem. 这个威胁是围绕这个问题的。 But I think that my question is for both points of view:但我认为我的问题是针对两种观点的:
Thanks谢谢
You asked this a while ago but I'll attempt an answer in case anyone else was wondering the same thing.你刚才问过这个问题,但我会尝试回答,以防其他人想知道同样的事情。 First, I like to split up my task first and then loop over each part.首先,我喜欢先拆分我的任务,然后循环遍历每个部分。 This gives me more control over the process.这让我可以更好地控制这个过程。
parts <- split(df, c(df$class, df$group))
mclapply(parts, some_function)
Second, distributing tasks to multiple cores takes a lot of computational overhead and can cancel out any gains your make from paralleizing your script.其次,将任务分配到多个核心需要大量的计算开销,并且会抵消您从并行化脚本中获得的任何收益。 Here, mclapply
splits the job into however many nodes you have and performs the fork once.在这里, mclapply
将作业拆分为您拥有的mclapply
多个节点并执行一次分叉。 This is much more efficient than nesting two mclapply
loops.这比嵌套两个mclapply
循环更有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.