简体   繁体   English

如何在R中的“ foreach”循环中转换“ for循环”?

[英]How to transform a “for loop” in a “foreach” loop in R?

I'm dealing with a problem that requires parallel computing for getting faster results than with a classic "for loop". 我要解决的问题是需要并行计算才能获得比经典的“ for循环”更快的结果。

Here's the problem: 这是问题所在:

I need to generate linear models for 198135 outcome variables contained in dataframes within a list object. 我需要为列表对象内数据框中包含的198135结果变量生成线性模型。 I have to store all beta and p values for each predictive variable in the models in a dataframe, and also their goodness-of-fit measures. 我必须将模型中每个预测变量的所有beta和p值以及它们的拟合优度存储在数据库中。

I wrote a functional "for loop" that does the task properly, but it takes more than 35 hours for finishing it. 我编写了可以正常完成任务的功能“ for循环”,但完成该过程需要花费35个小时以上。 I know that R is using less than 20% of my 8-core CPU, and I would like to use it all. 我知道R使用的我的8核CPU不到20%,我想全部使用。 The problem is that I don't know how to transform my for loop in a foreach loop for taking advantage of parallel computing. 问题是我不知道如何利用并行计算在foreach循环中转换for循环。

Here's some example code of my problem in a smaller scale: 这是我的问题的一些示例代码,规模较小:

library(tidyverse)
library(broom)

## Example data 

outcome_list <- list(as.data.frame(cbind(rnorm(32), dataframe_id = c(1))),
                     as.data.frame(cbind(rnorm(32), dataframe_id =  c(2))),
                     as.data.frame(cbind(rnorm(32), dataframe_id =  c(3)))) ## This represents my list of 198135 dataframes

mtcars <- mtcars #I will use the explanatory variables from here



## Below this line is my current solution with a for loop that works fine

x <- list()
results_df <- as.data.frame(cbind(dataframe_id = c(0), intercept = c(0),
                                b_mpg = c(0), p_mpg = c(0),
                                b_cyl = c(0), p_cyl = c(0),
                                p.model = c(0), AIC = c(0),
                                BIC = c(0)))

for(i in 1:3){
  x[[i]] <- lm(outcome_list[[i]]$V1 ~ mtcars$mpg + mtcars$cyl)
  gof <- broom::glance(x[[i]])
  betas <- broom::tidy(x[[i]])
  results_df <- rbind(results_df, c(outcome_list[[i]]$V2[1], 
                                    betas$estimate[1],
                                    betas$estimate[2], betas$p.value[2], 
                                    betas$estimate[3], betas$p.value[3],
                                    gof$p.value, gof$r.squared, gof$AIC,
                                    gof$BIC))

  if(i %% i == 0){
    message(paste(i, "of 3")) # To know if my machine has not crashed
    x <- list() # To keep RAM clean of useless data
  }
  gc()
}

results_df <- results_df[-1, ]



With the code shown above I get the results that I need (a dataframe with regression parameters and goodness of fit for each outcome variable from the list), but it is very slow because I'm not able to use all of my computer power. 使用上面显示的代码,我得到了所需的结果(具有回归参数的数据框,并且列表中每个结果变量的拟合度都很好),但是它非常慢,因为我无法使用我所有的计算机功能。

I know that using "foreach" and "doParallel" packages I can solve this problem in a faster way, but I still don't understand the logic behind foreach loops structure, since it's the first time I need to process so many data. 我知道使用“ foreach”和“ doParallel”包可以更快地解决此问题,但是我仍然不了解foreach循环结构背后的逻辑,因为这是我第一次需要处理大量数据。

PS: I've already tried several ways with foreach function but I didn't get anywhere. PS:我已经尝试过几种foreach函数的方法,但是我什么也没做。 I didn't write my foreach atempts of solutions because I'm not understanding what I'm doing. 我没有写我的foreach解决方案,因为我不了解自己在做什么。

You can do: 你可以做:

## Example data 
outcome_list <- list(as.data.frame(cbind(rnorm(32), dataframe_id = c(1))),
                     as.data.frame(cbind(rnorm(32), dataframe_id = c(2))),
                     as.data.frame(cbind(rnorm(32), dataframe_id = c(3))))

## Parallel code
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
results_list <- foreach(i = 1:3) %dopar% {

  mylm <- lm(outcome_list[[i]]$V1 ~ mtcars$mpg + mtcars$cyl)
  gof <- broom::glance(mylm)
  betas <- broom::tidy(mylm)

  c(outcome_list[[i]]$V2[1], 
    betas$estimate[1],
    betas$estimate[2], betas$p.value[2], 
    betas$estimate[3], betas$p.value[3],
    gof$p.value, gof$r.squared, gof$AIC,
    gof$BIC)
}
stopCluster(cl)

results_df <- setNames(as.data.frame(do.call("rbind", results_list)),
                       c("dataframe_id", "intercept", "b_mpg", "p_mpg", 
                         "b_disp", "p_disp", "p.model", "AIC", "BIC"))

Your return your results in foreach (that works like lapply) instead of growing an object (which is not possible in parallel BTW). 您将结果返回到foreach中(类似于lapply),而不是增长对象(在并行BTW中是不可能的)。

Learn more on how to use foreach there . 此处了解有关如何使用foreach的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM