简体   繁体   English

在R中使用Apply系列并行化用户定义的函数

[英]Parallelize user-defined function using apply family in R

I have a script that takes too long to compute and I'm trying to paralellize its execution. 我有一个脚本,计算时间太长,并且正在尝试对其执行进行并行处理。

The script basically loops through each row of a data frame and perform some calculations as shown below: 该脚本基本上遍历数据帧的每一行并执行一些计算,如下所示:

my.df = data.frame(id=1:9,value=11:19)

sumPrevious <- function(df,df.id){
    sum(df[df$id<=df.id,"value"])
}

for(i in 1:nrow(my.df)){
    print(sumPrevious(my.df,my.df[i,"id"]))
}

I'm starting to learn to parallelize code in R, this is why I first want to understand how I could do this with an apply-like function (eg sapply,lapply,mapply). 我开始学习在R中并行化代码,这就是为什么我首先想了解如何使用类似于app的函数(例如sapply,lapply,mapply)来做到这一点。

I've tried multiple things but nothing worked so far: 我已经尝试了多种方法,但到目前为止没有任何效果:

mapply(sumPrevious,my.df,my.df$id) # Error in df$id : $ operator is invalid for atomic vectors

Using the parallel package in R you can use the mclapply() function. 使用R中的parallel包,可以使用mclapply()函数。 You will need to adjust your code a little bit to make it run in parallel. 您将需要稍微调整代码以使其并行运行。

library(parallel)
my.df = data.frame(id=1:9,value=11:19)

sumPrevious <- function(i,df){df.id = df$id[i]
    sum(df[df$id<=df.id,"value"])
}

mclapply(X = 1:nrow(my.df),FUN = sumPrevious,my.df,mc.preschedule = T,mc.cores = no.of.cores)

This code will run the sumPrevious in parallel on no.of.cores in your machine. 此代码将并行运行的sumPrevious no.of.cores在你的机器。

Well, this is fun playing with. 好吧,这很有趣。 you kind need something like below: 您需要以下内容:

 mapply(sumPrevious,list(my.df),my.df$id)

For supply, since the first input is the dataframe, you will have to define a given function for it to be ale to recognize it so: 对于供应,由于第一个输入是数据框,因此您必须定义一个给定的函数以使其能够识别出以下内容:

  sapply(my.df$id,function(x,y) sumPrevious(y,x),my.df)

I prefer mapply here since we can set the first value to be imputed as the dataframe directly. 我在这里更喜欢mapply,因为我们可以直接将第一个值设置为数据帧。 But the whole of the dataframe. 但是整个数据帧。 That's why you have to use the function list . 这就是为什么必须使用功能list

Map ia a wrapper of mapply and thus would just present the solution in a list format. Mapmapply的包装,因此只能以列表格式显示解决方案。 try it. 试试吧。 Also lapply is similar to sapply only that sapply would have to simplify the results into an array format while lapply would give the same results as a list. lapply类似于sapplysapply将不得不结果简化成阵列形式而lapply将给出相同的结果的列表。

Though it seems whatever you are trying to do can simply be done by a cumsum function. 尽管看起来您想做的任何事情都可以简单地通过cumsum函数完成。

 cumsum(df$values)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM