有没有办法跟踪 mclapply 的进度？

Question

我喜欢plyr's llply设置.progress = 'text' 。 然而，这让我非常焦虑，不知道mclapply （来自包multicore ）有多远，因为列表项被发送到各个核心，然后在最后进行整理。

我一直在输出像*currently in sim_id # ....*这样的消息*currently in sim_id # ....*不是很有帮助，因为它没有告诉我列表项完成的百分比（尽管知道我的脚本是有帮助的）没有卡住并继续前进）。

有人可以提出其他想法让我查看我的.Rout文件并获得进步感吗？ 我想过添加一个手动计数器，但不知道我将如何实现它，因为mclapply必须在它可以给出任何反馈之前完成处理所有列表项。

Answer 1

由于mclapply产生多个进程这一事实，人们可能想要使用 fifos、管道甚至套接字。 现在考虑以下示例：

library(multicore)

finalResult <- local({
    f <- fifo(tempfile(), open="w+b", blocking=T)
    if (inherits(fork(), "masterProcess")) {
        # Child
        progress <- 0.0
        while (progress < 1 && !isIncomplete(f)) {
            msg <- readBin(f, "double")
            progress <- progress + as.numeric(msg)
            cat(sprintf("Progress: %.2f%%\n", progress * 100))
        } 
        exit()
    }
    numJobs <- 100
    result <- mclapply(1:numJobs, function(...) {
        # Dome something fancy here
        # ...
        # Send some progress update
        writeBin(1/numJobs, f)
        # Some arbitrary result
        sample(1000, 1)
    })
    close(f)
    result
})

cat("Done\n")

这里使用一个临时文件作为fifo，主进程fork一个子进程，它唯一的职责就是报告当前的进度。 主进程通过调用mclapply继续，其中要评估的表达式（更准确地说，表达式块）通过writeBin将部分进度信息写入 fifo 缓冲区。

由于这只是一个简单的示例，您可能必须根据您的需要调整整个输出内容。 哼！

Answer 2

本质上添加了@fotNelson 解决方案的另一个版本，但进行了一些修改：

替代 mclapply（支持所有 mclapply 功能）
捕获 ctrl-c 调用并优雅地中止
使用内置进度条 (txtProgressBar)
是否跟踪进度并使用指定样式的进度条的选项
使用parallel而不是现在已从 CRAN 中删除的multicore
强制 X 按照 mclapply 列出（因此 length(X) 给出了预期的结果）
顶部的 roxygen2 样式文档

希望这可以帮助某人...

library(parallel)

#-------------------------------------------------------------------------------
#' Wrapper around mclapply to track progress
#' 
#' Based on http://stackoverflow.com/questions/10984556
#' 
#' @param X         a vector (atomic or list) or an expressions vector. Other
#'                  objects (including classed objects) will be coerced by
#'                  ‘as.list’
#' @param FUN       the function to be applied to
#' @param ...       optional arguments to ‘FUN’
#' @param mc.preschedule see mclapply
#' @param mc.set.seed see mclapply
#' @param mc.silent see mclapply
#' @param mc.cores see mclapply
#' @param mc.cleanup see mclapply
#' @param mc.allow.recursive see mclapply
#' @param mc.progress track progress?
#' @param mc.style    style of progress bar (see txtProgressBar)
#'
#' @examples
#' x <- mclapply2(1:1000, function(i, y) Sys.sleep(0.01))
#' x <- mclapply2(1:3, function(i, y) Sys.sleep(1), mc.cores=1)
#' 
#' dat <- lapply(1:10, function(x) rnorm(100)) 
#' func <- function(x, arg1) mean(x)/arg1 
#' mclapply2(dat, func, arg1=10, mc.cores=2)
#-------------------------------------------------------------------------------
mclapply2 <- function(X, FUN, ..., 
    mc.preschedule = TRUE, mc.set.seed = TRUE,
    mc.silent = FALSE, mc.cores = getOption("mc.cores", 2L),
    mc.cleanup = TRUE, mc.allow.recursive = TRUE,
    mc.progress=TRUE, mc.style=3) 
{
    if (!is.vector(X) || is.object(X)) X <- as.list(X)

    if (mc.progress) {
        f <- fifo(tempfile(), open="w+b", blocking=T)
        p <- parallel:::mcfork()
        pb <- txtProgressBar(0, length(X), style=mc.style)
        setTxtProgressBar(pb, 0) 
        progress <- 0
        if (inherits(p, "masterProcess")) {
            while (progress < length(X)) {
                readBin(f, "double")
                progress <- progress + 1
                setTxtProgressBar(pb, progress) 
            }
            cat("\n")
            parallel:::mcexit()
        }
    }
    tryCatch({
        result <- mclapply(X, ..., function(...) {
                res <- FUN(...)
                if (mc.progress) writeBin(1, f)
                res
            }, 
            mc.preschedule = mc.preschedule, mc.set.seed = mc.set.seed,
            mc.silent = mc.silent, mc.cores = mc.cores,
            mc.cleanup = mc.cleanup, mc.allow.recursive = mc.allow.recursive
        )

    }, finally = {
        if (mc.progress) close(f)
    })
    result
}

Answer 3

pbapply包已经为一般情况实现了这一点（即在类 Unix 和 Windows 上，也适用于 RStudio）。 pblapply和pbsapply都有一个cl参数。 从文档：

可以通过cl参数启用并行处理。 parLapply在cl是“ cluster ”对象时调用， mclapply在cl是整数时调用。 与没有进度条的功能的并行等效项相比，显示进度条增加了主进程和节点/子进程之间的通信开销。 当进度条被禁用（即getOption("pboptions")$type == "none" dopb()是FALSE ）时，这些函数会回退到它们原来的等价物。 如果为FALSE （即从命令行 R 脚本调用interactive()这是interactive()时的默认值。

如果不提供cl （或传递NULL ），则使用默认的非并行lapply ，还包括一个进度条。

Answer 4

这是一个基于@fotNelton 解决方案的函数，适用于您通常使用 mcapply 的任何地方。

mcadply <- function(X, FUN, ...) {
  # Runs multicore lapply with progress indicator and transformation to
  # data.table output. Arguments mirror those passed to lapply.
  #
  # Args:
  # X:   Vector.
  # FUN: Function to apply to each value of X. Note this is transformed to 
  #      a data.frame return if necessary.
  # ...: Other arguments passed to mclapply.
  #
  # Returns:
  #   data.table stack of each mclapply return value
  #
  # Progress bar code based on https://stackoverflow.com/a/10993589
  require(multicore)
  require(plyr)
  require(data.table)

  local({
    f <- fifo(tempfile(), open="w+b", blocking=T)
    if (inherits(fork(), "masterProcess")) {
      # Child
      progress <- 0
      print.progress <- 0
      while (progress < 1 && !isIncomplete(f)) {
        msg <- readBin(f, "double")
        progress <- progress + as.numeric(msg)
        # Print every 1%
        if(progress >= print.progress + 0.01) {
          cat(sprintf("Progress: %.0f%%\n", progress * 100))
          print.progress <- floor(progress * 100) / 100
        }
      }
      exit()
    }

    newFun <- function(...) {
      writeBin(1 / length(X), f)
      return(as.data.frame(FUN(...)))
    }

    result <- as.data.table(rbind.fill(mclapply(X, newFun, ...)))
    close(f)
    cat("Done\n")
    return(result)
  })
}

Answer 5

您可以使用您的系统 echo 函数从您的工作人员处进行写入，因此只需将以下行添加到您的函数中：

myfun <- function(x){
if(x %% 5 == 0) system(paste("echo 'now processing:",x,"'"))
dosomething(mydata[x])
}

result <- mclapply(1:10,myfun,mc.cores=5)
> now processing: 5 
> now processing: 10

如果您传递一个索引，例如，这将起作用，因此不是传递数据列表，而是传递索引并在工作函数中提取数据。

Answer 6

根据@fotNelson 的回答，使用进度条代替逐行打印并使用 mclapply 调用外部函数。

library('utils')
library('multicore')

prog.indic <- local({ #evaluates in local environment only
    f <- fifo(tempfile(), open="w+b", blocking=T) # open fifo connection
    assign(x='f',value=f,envir=.GlobalEnv)
    pb <- txtProgressBar(min=1, max=MC,style=3)

    if (inherits(fork(), "masterProcess")) { #progress tracker
        # Child
        progress <- 0.0
        while (progress < MC && !isIncomplete(f)){ 
            msg <- readBin(f, "double")
                progress <- progress + as.numeric(msg)

            # Updating the progress bar.
            setTxtProgressBar(pb,progress)
            } 


        exit()
        }
   MC <- 100
   result <- mclapply(1:MC, .mcfunc)

    cat('\n')
    assign(x='result',value=result,envir=.GlobalEnv)
    close(f)
    })

.mcfunc<-function(i,...){
        writeBin(1, f)
        return(i)
    }

要从 mclapply 调用之外的函数中使用它，必须将 fifo 连接分配给 .GlobalEnv。 感谢您的问题和之前的回复，我一直想知道如何做到这一点。

有没有办法跟踪 mclapply 的进度？

问题描述

6 个解决方案

解决方案1
26 已采纳 2012-06-12 09:11:17

解决方案2
15 2014-11-12 17:31:32

解决方案3
12 2016-12-07 10:23:58

解决方案4
7 2013-10-29 06:19:53

解决方案5
3 2019-05-10 12:13:43

解决方案6
2 2013-11-05 15:36:54

有没有办法跟踪 mclapply 的进度？

问题描述

6 个解决方案

解决方案1 26 已采纳 2012-06-12 09:11:17

解决方案2 15 2014-11-12 17:31:32

解决方案3 12 2016-12-07 10:23:58

解决方案4 7 2013-10-29 06:19:53

解决方案5 3 2019-05-10 12:13:43

解决方案6 2 2013-11-05 15:36:54

解决方案1
26 已采纳 2012-06-12 09:11:17

解决方案2
15 2014-11-12 17:31:32

解决方案3
12 2016-12-07 10:23:58

解决方案4
7 2013-10-29 06:19:53

解决方案5
3 2019-05-10 12:13:43

解决方案6
2 2013-11-05 15:36:54