什么时候（如果有的话）我应该告诉R parallel不使用所有内核？

Question

I've been using this code: 我一直在使用以下代码：

library(parallel)
cl <- makeCluster( detectCores() - 1)
clusterCall(cl, function(){library(imager)})

then I have a wrapper function looking something like this: 然后我有一个包装函数，看起来像这样：

d <- matrix  #Loading a batch of data into a matrix
res <- parApply(cl, d, 1, FUN, ...)
# Upload `res` somewhere

I tested on my notebook, with 8 cores (4 cores, hyperthreading). 我在笔记本电脑上测试了8核（4核，超线程）。 When I ran it on a 50,000 row, 800 column, matrix, it took 177.5s to complete, and for most of the time the 7 cores were kept at near 100% (according to top), then it sat there for the last 15 or so seconds, which I guess was combining results. 当我在50,000行，800列矩阵上运行它时，花了177.5s来完成，并且在大多数情况下，这7个核心保持在接近100％的位置（根据顶部），然后在最后15个地方停在那里大概几秒钟，我想这就是结果。 According to system.time() , user time was 14s, so that matches. 根据system.time() ，用户时间为14秒，因此匹配。

Now I'm running on EC2, a 36-core c4.8xlarge, and I'm seeing it spending almost all of its time with just one core at 100%. 现在，我在36核c4.8xlarge EC2上运行，并且我看到它几乎全部时间都花在了100％的一个核上。 More precisely: There is an approx 10-20 secs burst where all cores are being used, then about 90 secs of just one core at 100% (being used by R), then about 45 secs of other stuff (where I save results and load the next batch of data). 更精确地说：在使用所有内核的情况下，大约有10秒到20秒的突发时间，然后在100％的情况下（仅由R使用）大约90秒的一个内核，然后是其他内容的大约45秒（我保存结果并保存加载下一批数据）。 I'm doing batches of 40,000 rows, 800 columns. 我正在批量处理40,000行，800列。

The long-term load average, according to top, is hovering around 5.00. 根据顶部，长期平均负载在5.00附近。

Does this seem reasonable? 这看起来合理吗？ Or is there a point where R parallelism spends more time with communication overhead, and I should be limiting to eg 16 cores. 还是有一点R并行花费更多的时间来处理通信开销，我应该限制为例如16个内核。 Any rules of thumb here? 这里有什么经验法则吗？

Ref: CPU spec I'm using "Linux 4.4.5-15.26.amzn1.x86_64 (amd64)". 参考： CPU规格我正在使用“ Linux 4.4.5-15.26.amzn1.x86_64（amd64）”。 R version 3.2.2 (2015-08-14) R版本3.2.2（2015-08-14）

UPDATE: I tried with 16 cores. 更新：我尝试使用16个内核。 For the smallest data, run-time increased from 13.9s to 18.3s. 对于最小的数据，运行时间从13.9s增加到18.3s。 For the medium-sized data: 对于中型数据：

With 16 cores:
   user  system elapsed 
 30.424   0.580  60.034 

With 35 cores:
   user  system elapsed 
 30.220   0.604  54.395

Ie the overhead part took the same amount of time, but the parallel bit had fewer cores so took longer, and so it took longer overall. 即，开销部分花费了相同的时间，但是并行位的内核更少，因此花费了更长的时间，因此总体上花费了更长的时间。

I also tried using mclapply() , as suggested in the comments. 我也尝试使用mclapply() ，如注释中所建议。 It did appear to be a bit quicker (something like 330s vs. 360s on the particular test data I tried it on), but that was on my notebook, where other processes, or over-heating, could affect the results. 它的确似乎要快一些（在我尝试过的特定测试数据上大约为330s与360s），但这在我的笔记本上，在此其他过程或过热可能会影响结果。 So, I'm not drawing any conclusions on that yet. 因此，我尚未对此得出任何结论。

Answer 1

There are no useful rules of thumb — the number of cores that a parallel task is optimal for is entirely determined by said task. 没有有用的经验法则-并行任务最适合的内核数量完全由所述任务确定。 For a more general discussion see Gustafson's law . 有关更一般的讨论，请参见古斯塔夫森定律。

The high single-core portion that you're seeing in your code probably comes from the end phase of the algorithm (the “join” phase), where the parallel results are collated into a single data structure. 您在代码中看到的高单核部分可能来自算法的结束阶段（“联接”阶段），在该阶段，并行结果被整理为单个数据结构。 Since this far surpasses the parallel computation phase, this may indeed be an indication that fewer cores could be beneficial. 由于这远远超出了并行计算阶段，因此这实际上可能表明较少的内核可能是有益的。

Answer 2

I'd add that in case you are not aware of this wonderful resource for parallel computing in R, you may find reading Norman Matloff's recent book Parallel Computing for Data Science: With Examples in R, C++ and CUDA a very helpful read. 我还要补充一点，如果您不了解R中用于并行计算的出色资源，您可能会发现阅读Norman Matloff的最新著作Parallel Computing for Data Science: With Examples in R, C++ and CUDA的Parallel Computing for Data Science: With Examples in R, C++ and CUDA非常有帮助。 I'd highly recommend it (I learnt a lot, not coming from a CS background). 我强烈推荐它（我学到了很多东西，而不是来自CS背景）。

The book answers your question in depth (Chapter 2 specifically). 这本书深入回答了您的问题（特别是第2章）。 The book gives a high level overview of the causes of overhead that lead to bottlenecks to parallel programs. 这本书对导致并行程序瓶颈的开销原因进行了高层次的概述。

Quoting section 2.1, which implicitly partially answers your question: 引用第2.1节，该节隐式部分回答了您的问题：

There are two main performance issues in parallel programming: 并行编程中存在两个主要的性能问题：

Communications overhead : Typically data must be transferred back and forth between processes. 通信开销 ：通常，必须在进程之间来回传输数据。 This takes time, which can take quite a toll on performance. 这需要时间，这可能会对性能造成很大的影响。 In addition, the processes can get in each other's way if they all try to access the same data at once. 此外，如果所有进程都试图一次访问相同的数据，则它们之间可能会互相干扰。 They can collide when trying to access the same communications channel, the same memory module, and so on. 当尝试访问相同的通信通道，相同的内存模块等时，它们可能会发生冲突。 This is another sap on speed. 这是速度上的另一个障碍。 The term granularity is used to refer, roughly, to the ratio of computa- tion to overhead. 术语“粒度”用于大致指代计算与开销之比。 Large-grained or coarse-grained algorithms involve large enough chunks of computation that the overhead isn't much of a problem. 大粒度或粗粒度算法涉及足够大的计算块，因此开销并不是什么大问题。 In fine-grained algorithms, we really need to avoid overhead as much as possible. 在细粒度算法中，我们确实确实需要尽可能避免开销。

^ When overhead is high, less cores for the problem at hand can give shorter total computation time. ^当开销很高时，针对当前问题的较少核心可以缩短总计算时间。

Load balance : As noted in the last chapter, if we are not careful in the way in which we assign work to processes, we risk assigning much more work to some than to others. 负载平衡 ：如上一章所述，如果我们不谨慎地将工作分配给流程，则分配给某些工作的风险要大于分配给其他工作的风险。 This compromises performance, as it leaves some processes unproductive at the end of the run, while there is still work to be done. 这会损害性能，因为在运行结束时会使某些过程失去生产力，而仍有工作要做。

When if ever do not use all cores? 什么时候不使用所有内核？ One example from my personal experience in running daily cronjobs in R on data that amounts to 100-200GB data in RAM, in which multiple cores are run to crunch blocks of data, I've indeed found running with say 6 out of 32 available cores to be faster than using 20-30 of the cores. 从我的个人经验中可以得出一个例子，我每天在R中运行的cronjob数据在RAM中的数据量为100-200GB，其中运行多个核以处理数据块，我确实发现32个可用核中有6个可以运行比使用20-30个内核更快。 A major reason was memory requirements for children processes (After a certain number of children processes were in action, memory usage got high and things slowed down considerably). 一个主要原因是子进程需要内存（在执行了一定数量的子进程之后，内存使用率很高，并且运行速度大大降低）。

什么时候（如果有的话）我应该告诉R parallel不使用所有内核？

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-09-13 10:19:54

解决方案2
2 2016-10-07 00:56:05

什么时候（如果有的话）我应该告诉R parallel不使用所有内核？

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-09-13 10:19:54

解决方案2 2 2016-10-07 00:56:05

解决方案1
2 已采纳 2016-09-13 10:19:54

解决方案2
2 2016-10-07 00:56:05