简体   繁体   English

用应用函数替换在 R 中的 dfs 列表上迭代运行向后逐步回归的 for 循环,以减少计算时间

[英]Replacing a for-loop which iteratively runs Backward Stepwise Regressions on a list of dfs in R with an apply function to reduce compute time

The R scripts and a reduced version of the file folder with the multiple csv file-formatted datasets in it can all be found in my GitHub Repository found in this Link . R 脚本和其中包含多个 csv 文件格式数据集的文件夹的简化版本都可以在我的 GitHub 存储库中找到,在此链接中。

In my script called 'LASSO code', after loading a file folder full of N csv file-formatted datasets into R and assigning them all to a list called 'datasets', I ran the following code to fit N LASSO Regressions, one to each of the datasets:在我名为“LASSO 代码”的脚本中,将一个包含 N 个 csv 文件格式数据集的文件夹加载到 R 并将它们全部分配给一个名为“数据集”的列表后,我运行了以下代码以适应 N 个 LASSO 回归,一个到每个数据集:

set.seed(11)     # to ensure replicability
LASSO_fits <- lapply(dfs, function(i) 
               enet(x = as.matrix(select(i, starts_with("X"))), 
                    y = i$Y, lambda = 0, normalize = FALSE))

Now, I would like to replicate this same process for a Backward Elimination Stepwise Regression we'll keep it simple by just using the step() function from the stats library) using another apply function rather than having to use a loop.现在,我想为向后消除逐步回归复制相同的过程,我们将通过使用统计库中的 step() 函数来保持简单)使用另一个应用函数而不是必须使用循环。 The problem is this, the only was I know how to do this is by initializing or prepping it before running it by first establishing:问题是这样的,我唯一知道如何做到这一点是在运行它之前先初始化或准备它,方法是首先建立:

set.seed(100)      # for reproducibility
full_fits <- vector("list", length = length(dfs))
Backward_Stepwise_fits <- vector("list", length = length(dfs))

And only then fitting all of the Backward_Stepwise_fits, but I cannot figure out how to put both full_fits and Backward_Stepwise_fits into the same apply function, the only way I can think of would be to use a for loop and stack them on top of each other inside of it, but that would be very computationally inefficient.然后才拟合所有 Backward_Stepwise_fits,但我不知道如何将 full_fits 和 Backward_Stepwise_fits 放入同一个应用函数中,我能想到的唯一方法是使用 for 循环并将它们堆叠在彼此内部它,但那将是非常低效的计算。 And the number of datasets NI will be running both of these on is 260,000! NI 将运行这两个数据集的数量是 260,000!

I wrote a for-loop that does in fact run, but it took over 12 hours to finish running on just 58,500 datasets which is unacceptably slow.我写了一个 for-loop,它确实运行了,但只用了 12 个小时就完成了 58,500 个数据集的运行,速度慢得令人无法接受。 The code I used for it is the following:我使用的代码如下:

set.seed(100)      # for reproducibility
for(i in seq_along(dfs)) {
  full_fits[[i]] <- lm(formula = Y ~ ., data = dfs[[i]])
  Backward_Stepwise_fits[[i]] <- step(object = full_fits[[i]], 
                        scope = formula(full_fits[[i]]),
                        direction = 'backward', trace = 0) }

I have tried the following, but get the corresponding error message in the Console:我尝试了以下操作,但在控制台中收到了相应的错误消息:

> full_model_fits <- lapply(datasets, function(i)
+   lm(formula = Y ~ ., data = datasets))
Error in terms.formula(formula, data = data) : 
duplicated name 'X1' in data frame using '.'

Ever thought about parallelizing the whole thing?有没有想过将整个事情并行化?

First, you could define the code more succinctly.首先,您可以更简洁地定义代码。

system.time(
  res <- lapply(lst, \(X) {
    full <- lm(Y ~ ., X)
    back <- step(full, scope=formula(full), dir='back', trace=FALSE)
  })
)
#  user  system elapsed 
# 3.895   0.008   3.897 

system.time(
  res1 <- lapply(lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
#  user  system elapsed 
# 3.820   0.016   3.833 

stopifnot(all.equal(res, res1))

The results are equal, but no time difference.结果相等,但没有时间差异。

Now, using parallel::parLapply .现在,使用parallel::parLapply

library(parallel)

CL <- makeCluster(detectCores() - 1L)
clusterExport(CL, c('lst'))

system.time(
  res2 <- parLapply(CL, lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
#  user  system elapsed 
# 0.075   0.032   0.861 

stopCluster(CL)

stopifnot(all.equal(res, res2))

On this machine about 4.5 times faster.在这台机器上大约快 4.5 倍。

Your error duplicated name 'X1' in data frame using '.'您的错误duplicated name 'X1' in data frame using '.' means, that in some of your datasets there are two columns named "X1" .意味着,在您的某些数据集中,有两列名为"X1" Here's how to find them:找到它们的方法如下:

names(lst$dat6)[9] <- 'X1'  ## producing duplicated column X1 for demo 

sapply(lst, \(x) anyDuplicated(names(x)))
# dat1  dat2  dat3  dat4  dat5  dat6  dat7  dat8  dat9 dat10 dat11 
# 0     0     0     0     0     9     0     0     0     0     0 
# ...

Result shows, in dataset dat6 the 9 th column is the (first) duplicate.结果显示,在数据集dat6中,第9列是(第一个)重复项。 All others are clean.其他都是干净的。


Data:数据:

n <- 50
lst <- replicate(n, {dat <- data.frame(matrix(rnorm(500*30), 500, 30))
cbind(Y=rowSums(as.matrix(dat)%*%rnorm(ncol(dat))) + rnorm(nrow(dat)), dat)}, simplify=FALSE) |> 
  setNames(paste0('dat', seq_len(n)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM