简体   繁体   English

如何将函数 clusterApply 应用于并行计算?

[英]How to apply function clusterApply to parallel computing?

I have a function sum_var that take an integer as input, and returns a real number as output.我有一个函数sum_var ,它以整数作为输入,并返回一个实数作为输出。 I checked this function on some inputs and it runs well.我在一些输入上检查了这个功能,它运行良好。

I would like to use clusterApply to utilize my CPU (6 cores and 12 logical processors).我想使用clusterApply来利用我的 CPU(6 个内核和 12 个逻辑处理器)。 I've tried to modify the code given in the class我试图修改类中给出的代码

library("parallel")
cl <- makeCluster(6)
res_par <- clusterApply(cl, 1:10000, fun = sum_var)

But it returns an error Error in checkForRemoteErrors(val) : 10000 nodes produced errors; first error: object 'df_simulate' not found但它Error in checkForRemoteErrors(val) : 10000 nodes produced errors; first error: object 'df_simulate' not found返回错误Error in checkForRemoteErrors(val) : 10000 nodes produced errors; first error: object 'df_simulate' not found Error in checkForRemoteErrors(val) : 10000 nodes produced errors; first error: object 'df_simulate' not found . Error in checkForRemoteErrors(val) : 10000 nodes produced errors; first error: object 'df_simulate' not found

Could you please elaborate on how to achieve my goal?您能否详细说明如何实现我的目标? Below is the full code.下面是完整的代码。

### Generate dataframe
n_simu <- 1000
set.seed(1)
df_simulate <- data.frame(x_1 = rnorm(n_simu))
for (k in 2:10000) {
set.seed(k)
df_simulate[, paste0("x_", k)] <- rnorm(n_simu)
}
df_simulate[, "y"] <- runif(n_simu, 0, 0.5)
df_simulate[df_simulate$x_40 > 0 & df_simulate$x_99 > 0.8, "y"] <-
df_simulate[df_simulate$x_40 > 0 & df_simulate$x_99 > 0.8, "y"] + 5.75
df_simulate[df_simulate$x_40 > 0 & df_simulate$x_99 <= 0.8 & df_simulate$x_30 > 0.5, "y"] <-
df_simulate[df_simulate$x_40 > 0 & df_simulate$x_99 <= 0.8 & df_simulate$x_30 > 0.5, "y"] + 18.95
df_simulate[df_simulate$x_40 > 0 & df_simulate$x_99 <= 0.8 & df_simulate$x_30 <= 0.5, "y"] <-
df_simulate[df_simulate$x_40 > 0 & df_simulate$x_99 <= 0.8 & df_simulate$x_30 <= 0.5, "y"] + 20.55
df_simulate[df_simulate$x_40 <= 0 & df_simulate$x_150 < 0.5, "y"] <-
df_simulate[df_simulate$x_40 <= 0 & df_simulate$x_150 < 0.5, "y"] - 5
df_simulate[df_simulate$x_40 <= 0 & df_simulate$x_150 >= 0.5, "y"] <-
df_simulate[df_simulate$x_40 <= 0 & df_simulate$x_150 >= 0.5, "y"] - 10

### Function to calculate the sum of variances
n_min <- 5
index <- n_min:(1000 - n_min)

sum_var <- function(m){
  df1 <- df_simulate[, m]
  df2 <- as.data.frame(sort(df1))
  for (i in index){
    df3 <- df2[1:i, 1]
    df4 <- df2[(i+1):1000, 1]
    df2[i, 2] <- sd(df3) + sd(df4)
  }
  position <- which.min(df2[, 2]) 
  return(df2[position, 1])
}

### Parallel Computing    
library("parallel")
cl <- makeCluster(6)
res_par <- clusterApply(cl, 1:10000, fun = sum_var)

When you use makeCluster on Windows, on every "cluster" a new R process is used.当您在 Windows 上使用makeCluster时,在每个“集群”上都会使用一个新的 R 进程。 There, only the base packages are loaded and the processes don't contain the variables you defined in your global environment.在那里,只加载基本包,进程不包含您在全局环境中定义的变量。 Therefore, you need to export all the variables you use in your function to the clusters.因此,您需要将您在函数中使用的所有变量导出到集群。 For this, you can use clusterExport :为此,您可以使用clusterExport

library("parallel")
cl <- makeCluster(6)
clusterExport(cl, "df_simulate")
res_par <- clusterApply(cl, 1:10000, fun = sum_var)

Here is a smalloverview and introduction to different parallelisation techniques in R.这是对 R 中不同并行化技术的简要概述和介绍

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM