简体   繁体   English

R中的并行递归function?

[英]Parallel recursive function in R?

I've been wracking my brain around this problem all week and could really use an outside perspective.我整个星期都在为这个问题绞尽脑汁,并且真的可以使用外部视角。 Basically I've built a recursive tree function where the output of each node in one layer is used as the input for a node in the subsequent layer.基本上我已经构建了一个递归树 function ,其中一层中每个节点的 output 用作后续层中节点的输入。 I've generated a toy example here where each call generates a large matrix, splits it into submatrices, and then passes those submatrices to subsequent calls.我在这里生成了一个玩具示例,其中每个调用都会生成一个大矩阵,将其拆分为子矩阵,然后将这些子矩阵传递给后续调用。 The key difference from similar questions on Stack is that each call of tree_search doesn't actually return anything, it just appends results onto a CSV file.与 Stack 上类似问题的主要区别在于,每次调用tree_search实际上并不返回任何内容,它只是将结果附加到 CSV 文件中。

Now I'd like to parallelize this function.现在我想并行化这个 function。 However, when I run it with mclapply and mc.cores=2 , the runtime increases!但是,当我使用mclapplymc.cores=2运行它时,运行时间会增加! The same happens when I run it on a multicore cluster with mc.cores=12 .当我在带有mc.cores=12的多核集群上运行它时,也会发生同样的情况。 What's going on here?这里发生了什么? Are the parent nodes waiting for the child nodes to return some output?父节点是否在等待子节点返回一些output? Does this have something to do with fork/socket parallelization?这与 fork/socket 并行化有关吗?

For background, this is part of an algorithm that models gene activation in white blood cells in response to viral infection.作为背景,这是模拟白细胞中基因激活以响应病毒感染的算法的一部分。 I'm a biologist and self-taught programmer so I'm a little out of my depth here - any help or leads would be really appreciated!我是一名生物学家和自学成才的程序员,所以我在这里有点超出我的深度 - 任何帮助或线索将不胜感激!

# Load libraries.
library(data.table)
library(parallel)

# Recursive tree search function.
tree_search <- function(submx = NA, loop = 0) {

  # Terminate on fifth loop.
  message(paste("Started loop", loop))
  if(loop == 5) {return(TRUE)}

  # Create large matrix and do some operation.
  bigmx <- matrix(rnorm(10), 50000, 250)
  bigmx <- sin(bigmx^2)

  # Aggregate matrix and save output.
  agg <- colMeans(bigmx)
  append <- file.exists("output.csv")
  fwrite(t(agg), file = "output.csv", append = append, row.names = F)

  # Split matrix in submatrices with 100 columns each.
  ind <- ceiling(seq_along(1:ncol(bigmx)) / 100)

  lapply(unique(ind), function(i) {

    submx <- bigmx[, ind == i]

    # Pass each submatrix to subsequent call.
    loop <- loop + 1
    tree_search(submx, loop) # sub matrix is used to generate big matrix in subsequent call (not shown)

  })

}

# Initiate tree search.
tree_search()

After a lot more brain wracking and experimentation, I ended up answering my own question.经过更多的脑筋急转弯和实验,我最终回答了我自己的问题。 I'm not going to refer to the original example since I've changed up my approach quite a bit.我不打算参考原始示例,因为我已经改变了很多方法。 Instead I'll share some general observations that might help people in similar situations.相反,我将分享一些可能对处于类似情况的人有所帮助的一般性观察。

1.) For loops are more memory efficient than lapply and recursive functions 1.) For循环比lapply和递归函数更有效

When you use lapply, each call creates a copy of your current environment.使用 lapply 时,每次调用都会创建当前环境的副本。 That's why you can do this:这就是为什么你可以这样做:

x <- 5
lapply(1:10, function(i) {
   x <- x + 1
   x == 6 # TRUE
})
x == 5 # ALSO TRUE

At the end x is still 5, which means that each call of lapply was manipulating a separate copy of x.最后 x 仍然是 5,这意味着每次调用lapply都在操作 x 的单独副本。 That's not good if, say, x was actually a large dataframe with 10,000 variables.如果 x 实际上是一个带有 10,000 个变量的大型 dataframe,那就不好了。 for loops, on the other hand, allow you to override the variables on each loop.另一方面, for循环允许您覆盖每个循环上的变量。

x <- 5
for(i in 1:10) {x <- x + 1}
x == 5 # FALSE

2.) Parallelize once 2.) 并行化一次

Distributing tasks to different nodes takes a lot of computational overhead and can cancel out any gains you make from parallelizing your script.将任务分配到不同的节点需要大量的计算开销,并且可以抵消您从并行化脚本中获得的任何收益。 Therefore, you should use mclapply with discretion.因此,您应该谨慎使用mclapply In my case, that meant NOT putting mclapply inside a recursive function where it was getting called tens to hundreds of times.就我而言,这意味着不要将mclapply放在递归的 function 中,它会被调用数十到数百次。 Instead, I split the starting point into 16 parts and ran 16 different tree searches on separate nodes.相反,我将起点分成 16 个部分,并在不同的节点上运行 16 次不同的树搜索。

3.) You can use mclapply to throttle memory usage 3.) 您可以使用mclapply来限制 memory 的使用

If you split a job into 10 parts and process them with mclapply and mc.preschedule=F , each core will only process 10% of your job at a time.如果您将作业分成 10 个部分并使用mclapplymc.preschedule=F处理它们,则每个核心一次只能处理 10% 的作业。 If mc.cores was set to two, for example, the other 8 "nodes" would wait until one part finished before starting a new one.例如,如果mc.cores设置为两个,那么其他 8 个“节点”将等到一个部分完成后再开始一个新的部分。 This is useful if you are running into memory issues and want to prevent each loop from taking on more than it can handle.如果您遇到 memory 问题并希望防止每个循环超出其处理能力,这将非常有用。

Final Note最后说明

This is one of the more interesting problems I've worked on so far.这是迄今为止我研究过的更有趣的问题之一。 However, recursive tree functions are complicated.但是,递归树函数很复杂。 Draw out the algorithm and force yourself to spend a few days away from your code so that you can come back with a fresh perspective.画出算法,强迫自己花几天时间远离你的代码,这样你就能以全新的视角回来。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM