简体   繁体   English

如何为并行化编写高效的嵌套函数?

[英]How to write efficient nested functions for parallelization?

I have a dataframe with two grouping variables class and group .我有一个带有两个分组变量classgroup的数据框。 For each class, I have a plotting task per group.对于每个班级,我每个小组都有一个绘图任务。 Mostly, I have 2 levels per class and 500 levels per group .大多数情况下,我必须每2级class每个500个级别group

I'm using parallel package for parallelization and mclapply function for the iteration through class and group levels.我正在使用parallel包进行并行化,并使用mclapply函数通过classgroup级别进行迭代。

I'm wondering which is the best way to write my iterations.我想知道哪种方法是编写我的迭代的最佳方式。 I think I have two options:我想我有两个选择:

  1. Run parallelization for class variable.class变量运行并行化。
  2. Run parallelization for group variable.group变量运行并行化。

My computer has 3 cores working for R session and usuarlly, preserve the 4th core for my Operating System.我的计算机有 3 个内核用于 R 会话,通常为我的操作系统保留第 4 个内核。 I was wondering that if perform the parallelization for class variable with 2 levels, the 3rd core will never will be used, so I thought that would be more efficient ensuring all 3 cores will be working running the parallelization for group variable.我想知道如果对具有 2 个级别的class变量执行并行化,将永远不会使用第 3 个核心,所以我认为确保所有 3 个核心都将运行group变量的并行化会更有效。 I've written some speed tests to be sure wich is the best way:我已经编写了一些速度测试,以确保这是最好的方法:

library(microbenchmark)
library(parallel)

f = function(class, group, A, B) {

  mclapply(seq(class), mc.cores = A, function(z) {
    mclapply(seq(group), mc.cores = B, function(c) {
      ifelse(class == 1, 'plotA', 'plotB')
    })
  })

}

class = 2
group = 500

microbenchmark(
  up = f(class, group, 3, 1),
  nest = f(class, group, 1, 3),
  times = 50L
)

Unit: milliseconds
 expr       min        lq     mean    median       uq      max neval
   up  6.751193  7.897118 10.89985  9.769894 12.26880 26.87811    50
 nest 16.584382 18.999863 25.54437 22.293591 28.60268 63.49878    50

Result tells that I shoud use the parallelization for class and not for group variable.结果告诉我应该对class而不是group变量使用并行化。

The overview would be that I allways shoud write one-core functions and then call it for parallelization.概述是我总是应该编写单核函数,然后调用它进行并行化。 I think this way, my code would be more simple or reductionist, than write nested functions with parallelization capabilities.我认为这样,我的代码会比编写具有并行化功能的嵌套函数更简单或更简化。

The ifelse condition is used because the previous code used to prepare the data for plotting task is more or less redundant for both class levels, so I thought it would be more line-coding efficient write a longer function checking which class level is used than "splitting" this function in two shorter functions.使用ifelse条件是因为之前用于准备绘图任务数据的代码对于两个class级别或多或少都是多余的,所以我认为编写一个更长的函数来检查使用哪个class级别比“将此功能拆分为两个较短的功能。

Which is the best practice to write this kind of code?.编写这种代码的最佳做法是什么? I seams clear, but because I'm not an expert data-scientist, I would like to know your working approach.我很清楚,但因为我不是专业的数据科学家,我想知道你的工作方法。

This threat is around this problem. 这个威胁是围绕这个问题的。 But I think that my question is for both points of view:但我认为我的问题是针对两种观点的:

  • Code beauty and clear代码美观清晰
  • Speed performance速度表现

Thanks谢谢

You asked this a while ago but I'll attempt an answer in case anyone else was wondering the same thing.你刚才问过这个问题,但我会尝试回答,以防其他人想知道同样的事情。 First, I like to split up my task first and then loop over each part.首先,我喜欢先拆分我的任务,然后循环遍历每个部分。 This gives me more control over the process.这让我可以更好地控制这个过程。

parts <- split(df, c(df$class, df$group))
mclapply(parts, some_function)

Second, distributing tasks to multiple cores takes a lot of computational overhead and can cancel out any gains your make from paralleizing your script.其次,将任务分配到多个核心需要大量的计算开销,并且会抵消您从并行化脚本中获得的任何收益。 Here, mclapply splits the job into however many nodes you have and performs the fork once.在这里, mclapply将作业拆分为您拥有的mclapply多个节点并执行一次分叉。 This is much more efficient than nesting two mclapply loops.这比嵌套两个mclapply循环更有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM