简体   繁体   English

在并行模式下运行具有多个参数的R函数

[英]Run a R function with multiple parameters in parallel mode

I have the function 我有这个功能

function1 <- function(df1, df2, int1, int2, char1)
{
...
return(newDataFrame)
}

which has 5 inputs: the first 2 are data frames, then I have two integers and a string. 它有5个输入:前2个是数据帧,然后我有两个整数和一个字符串。 The function returns a new data frame. 该函数返回一个新的数据框。

So far I am running this function 8 times sequentially: 到目前为止,我依次运行此函数8次:

newDataFrame1 <- function1(df1, df2, 1, 1, "someString")
newDataFrame2 <- function1(df1, df2, 2, 0, "someString")
newDataFrame3 <- function1(df1, df2, 3, 0, "someString")
newDataFrame4 <- function1(df1, df2, 4, 0, "someString")
newDataFrame5 <- function1(df1, df2, 5, 0, "someString")
newDataFrame6 <- function1(df1, df2, 6, 0, "someString")
newDataFrame7 <- function1(df1, df2, 7, 0, "someString")
newDataFrame8 <- function1(df1, df2, 8, 0, "someString")

and at the end I am combining results using rbind(): 最后我使用rbind()组合结果:

newDataFrameTot <-  rbind(newDataFrame1, newDataFrame2, newDataFrame3, newDataFrame4, newDataFrame5, newDataFrame6, newDataFrame7, newDataFrame8)

I wanted to run this in parallel using library(parallel) but I'm not able to figure out how to make this work. 我想使用库(并行)并行运行它,但我无法弄清楚如何使这项工作。 I am trying: 我在尝试:

cluster <- makeCluster(detectCores())
result <- clusterApply(cluster,1:8,function1)
newDataFrameTot <- do.call(rbind,result)

but this don't work unless my function function1() has only one parameter that I loop from 1 to 8. But this is not my case since I need to pass 5 inputs. 但这不起作用,除非我的函数function1()只有一个参数,我从1循环到8.但这不是我的情况,因为我需要传递5个输入。 How can I make this work in parallel? 我怎样才能并行完成这项工作?

To pass one variable you would have to use the parallel version of lapply or sapply like you tried. 要传递一个变量,您必须使用lapplysapply的并行版本,就像您尝试过的那样。 However, to pass many variables, you have to use the parallel version of mapply or Map . 但是,要传递许多变量,必须使用mapplyMap的并行版本。 That would be clusterMap , so try 那将是clusterMap ,所以试试吧

clusterMap(cluster, function1, df1, df2, 1:8, c(1, rep(0, 7)), "someString")

Edit As pointed out in the comments, this will throw an error. 编辑正如评论中指出的那样,这将引发错误。 Normally, arguments of length 1 (such as "someString" in this example) should be recycled to the length of the other ones (eg 1:8 in this example). 通常,长度为1的参数(例如本例中的"someString" )应该循环到其他参数的长度(例如,在此示例中为1:8 )。 The error thrown is due to the fact that the data frames are not recycled in the same manner, but are treated as lists instead, so their columns are repeated rather than the whole data frame. 抛出的错误是由于数据帧不以相同的方式回收,而是被视为列表,因此它们的列重复而不是整个数据帧。 This is why you got the error $ operator is invalid for atomic vectors because inside function1 , it attempted to use $ on the extracted column of a data frame, which was a vector, rather than the data frame itself. 这就是你得到错误$ operator is invalid for atomic vectors原因,因为在function1 ,它试图在数据帧的提取列上使用$ ,这是一个向量,而不是数据帧本身。 There are two remedies to this. 这有两种补救措施。 The first is to pass additional arguments inside MoreArgs , as mentioned in the other answer. 第一种是在MoreArgs传递其他参数,如另一个答案中所述。 This requires your arguments to be named (which is good practice anyway). 这需要你的参数被命名(无论如何这都是好的做法)。 The second way to fix it, is to wrap each data frame in a list: 修复它的第二种方法是将每个数据框包装在一个列表中:

clusterMap(cluster, function1, list(df1), list(df2), 1:8, c(1, rep(0, 7)), "someString")

This will work, because now the whole data frames df1 and df2 will be recycled. 这将有效,因为现在整个数据帧df1df2将被回收。 The difference can be seen eg by looking at the output of rep(df1, 2) vs rep(list(df1), 2) . 例如,通过查看rep(df1, 2) vs rep(list(df1), 2)的输出可以看出差异。

To iterate over more than one variable, clusterMap is very useful. 要迭代多个变量, clusterMap非常有用。 Since you're only iterating over int1 and int2 , you should use the "MoreArgs" option to specify the variables that you aren't iterating over: 由于您只是迭代int1int2 ,因此您应该使用“MoreArgs”选项来指定您迭代的变量:

cluster <- makeCluster(detectCores())
clusterEvalQ(cluster, library(xts))
result <- clusterMap(cluster, function1, int1=1:8, int2=c(1, rep(0, 7)),
                MoreArgs=list(df1=df1, df2=df2, char1="someString"))
df <- do.call('rbind', result)

In particular, if df1 and df2 are data frames and they are specified as iteration variables rather than using "MoreArgs", clusterMap will iterate over the columns of those data frames rather than passing the entire data frame to function1 , which isn't what you want. 特别是,如果df1df2是数据帧并且它们被指定为迭代变量而不是使用“MoreArgs”,则clusterMap将迭代这些数据帧的列而不是将整个数据帧传递给function1 ,这不是你的意思想。

Note that it's important to use named arguments so that the arguments are passed correctly. 请注意,使用命名参数以便正确传递参数非常重要。


A Note on Performance 关于绩效的说明

If either df1 or df2 is very large, you may get better performance by exporting them to the cluster workers. 如果df1df2非常大,则可以通过将它们导出到集群工作程序来获得更好的性能。 This avoids sending them in every task, but requires a wrapper function. 这避免了在每个任务中发送它们,但需要包装函数。 It also means that you no longer need to use the "MoreArgs" option: 这也意味着您不再需要使用“MoreArgs”选项:

clusterExport(cluster, c('df1', 'df2', 'function1'))
wrapper <- function(int1, int2, char1) {
  function1(df1, df2, int1, int2, char1)
}
result <- clusterMap(cluster, wrapper, 1:8, c(1, rep(0, 7)), "someString")

This allows df1 and df2 to be reused if the workers perform multiple tasks, but is pointless if the number of tasks is equal to the number of workers. 这允许在工作人员执行多个任务时重用df1df2 ,但如果任务数量等于工作人员数量则无意义。

As I had the same problem recently in R, I am attaching a link to a very useful website. 由于我最近在R中遇到了同样的问题,我附上了一个非常有用的网站的链接。 This is a new multidplyr package, which enables parallel processing in R. It definitely works in Windows 10. :) 这是一个新的multidplyr包,可以在R中进行并行处理。它绝对适用于Windows 10. :)

http://www.business-science.io/code-tools/2016/12/18/multidplyr.html http://www.business-science.io/code-tools/2016/12/18/multidplyr.html

To help you with your code this would be the solution I would propose (did not test, but should work as I used it on another example) 为了帮助您使用代码,这将是我建议的解决方案(没有测试,但应该像我在另一个例子中使用它一样工作)

#Install the packages
install.packages("devtools")
devtools::install_github("hadley/multidplyr")
require(multidplyr)
library(parallel)
cl <- detectCores()
cluster <- create_cluster(cores = cl)
cluster %>%
    # Assign libraries
    cluster_library("igraph") %>%
    cluster_library("tidyverse") %>%
    cluster_library("magrittr") %>%
    cluster_library("dplyr") %>%
    cluster_library("RColorBrewer") %>%
    # Assign values (use this to load functions or data to each core)
    cluster_assign_value("anyfunction", anyfunction)

result <- clusterMap(cluster, function1, int1=1:8, int2=c(1, rep(0, 7)),
            MoreArgs=list(df1=df1, df2=df2, char1="someString"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM