优化 R 中的 ifelse + 循环

Question

[[[[ I've been trying to optimize a loop in R, but because I'm not an expert, I can't make much progress. [[[[我一直在尝试优化R中的一个循环，但因为我不是专家，我无法取得太大进展。 I was wondering if you could help me because it's taking way too much time.]]]]我想知道您是否可以帮助我，因为这花费了太多时间。]]]]

Basically, I have a data frame and a list of data frames, as these ones here below:基本上，我有一个数据框和一个数据框列表，如下所示：


set.seed(123)

spp.list <- c("spA", "spB")
locations <- c("loc1", "loc2")
depths <- c(0:1)
years <- c(2000, 2001)
months <- c(1,2)

#Dataframe 1 (f.data):
n.rows <- 10
f.data <- data.frame(spp = sample(spp.list, n.rows, replace = T), 
                     location = sample(locations, n.rows, replace = T), 
                     depth = sample(depths, n.rows, replace = T),
                     Y = sample(years, n.rows, replace = T),
                     M = sample(months, n.rows, replace = T)
)




#List of dataframes (loc.list)
loc1 <- data.frame(Y = years,
                   M = months,
                   '0' = c(10,15),
                   '1' = c(0,5)
)
names(loc1)[3:4] <-  c(-0,-1)


loc2 <- data.frame(Y = years,
                   M = months,
                   '0' = c(13,18),
                   '1' = c(3,7)
)
names(loc2)[3:4] <-  c(-0,-1)

loc.list <- list(loc1,loc2)

names(loc.list) <- c('loc1','loc2')

Dataframe 1 (f.data) contains a list of species, locality, depth, year and month. Dataframe 1 (f.data) 包含物种、地点、深度、年份和月份的列表。 The list of dataframes (loc.list) contains individual dataframes for each locality (pretty much the same localities in f.data).数据帧列表 (loc.list) 包含每个位置的单独数据帧（在 f.data 中几乎相同的位置）。 Each individual dataframe in this list will also contain year and month, but also a value for distinct categories of an element (depth; each depth in this case is represented by a independent columns: 0 = surface, -1 = 1 m deep).此列表中的每个单独的 dataframe 还将包含年和月，还包含元素的不同类别的值（深度；在这种情况下，每个深度由独立的列表示：0 = 表面，-1 = 1 m 深）。

What I need to do is to screen both elements to match location, depth, year and month, so that I can assign the value recorded in loc.list into f.data.我需要做的是筛选这两个元素以匹配位置、深度、年份和月份，以便我可以将 loc.list 中记录的值分配给 f.data。 For instance,the first row of f.data says 'loc2', depth '0', year 2000, and month 1. Looking at loc.list, the value for loc2, year 2000, month 1 and depth 0 (column 0) is 13. Therefore, I'll copy this value (13) into a new column in f.data (f.data$temp).例如，f.data 的第一行表示“loc2”、深度“0”、2000 年和第 1 个月。查看 loc.list、loc2 的值、2000 年、第 1 个月和深度 0（第 0 列）是 13。因此，我将把这个值 (13) 复制到 f.data (f.data$temp) 的新列中。

> f.data
   spp location depth    Y M
1  spA     loc2     0 2000 1 <<<----
2  spA     loc2     1 2001 2
3  spA     loc2     0 2000 2
4  spB     loc1     0 2001 1
5  spA     loc2     0 2001 1
6  spB     loc1     0 2000 1
7  spB     loc2     1 2000 1
8  spB     loc1     1 2000 2
9  spA     loc1     0 2000 1
10 spA     loc1     1 2001 1

> loc.list
$loc1
     Y M  0 -1
1 2000 1 10  0
2 2001 2 15  5

$loc2
     Y M  0 -1
1 2000 1 13  3  <<<----
2 2001 2 18  7

Initially, I wrote a rudimentary long code that does the work, but it takes its time.最初，我编写了一个基本的长代码来完成这项工作，但这需要时间。 For a n.rows = 100000, for instance, I need ~ 18 seconds in my machine.例如，对于 n.rows = 100000，我的机器需要大约 18 秒。

f.data$temp <- NA

start.time <- Sys.time()

for (i in (1:nrow(f.data))) { 
 
 tryCatch({
   
   for (j in 1:length(loc.list)) { 
     
     for (k in 1:nrow(loc.list[[j]])) { 
       
       for (m in 3:ncol(loc.list[[j]])) {
         
         if (f.data$location[i] == names(loc.list)[j]) {
           
           if (f.data$Y[i] == loc.list[[j]]$Y[k]){ 
             
             if (f.data$M[i] == loc.list[[j]]$M[k]) { 
               
               if (round(f.data$depth[i], digits = 0) == (as.numeric (names(loc.list[[j]])[m])*(-1))) { 
                 
                 f.data$temp[i] <- loc.list[[j]][k,m]
                 
               }
             }
           }
         }
       }
     }
   }
 }, error = function(e){})
}

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

> f.data
  spp location depth    Y M temp
1  spA     loc2     0 2000 1   13
2  spA     loc2     1 2001 2    7
3  spA     loc2     0 2000 2   NA
4  spB     loc1     0 2001 1   NA
5  spA     loc2     0 2001 1   NA
6  spB     loc1     0 2000 1   10
7  spB     loc2     1 2000 1    3
8  spB     loc1     1 2000 2   NA
9  spA     loc1     0 2000 1   10
10 spA     loc1     1 2001 1   NA

I improved a bit the code using ifelse(), but within a classical loop.我使用 ifelse() 对代码进行了一些改进，但在经典循环中。 With n.rows = 100000, I get it done in less than a second.使用 n.rows = 100000，我可以在不到一秒的时间内完成。

f.data2 <- f.data[,-length(f.data)]

res2 <- c()

start.time2 <- Sys.time()

for (i in 1:length(loc.list)) { # to assess each df in the list
  for (j in 1:nrow(loc.list[[i]])) { # to assess each row of each df in the list
    for (m in 3:ncol(loc.list[[i]])) { # to assess each colum of each df in the list
      
      res <-  ifelse(f.data2$location == names(loc.list)[i]  &
                       f.data2$Y == loc.list[[i]]$Y[j]  &
                       f.data2$M == loc.list[[i]]$M[j] &
                       round(f.data2$depth, digits = 0) == (as.numeric (names(loc.list[[i]])[m])*(-1)),
                     loc.list[[i]][j,m], NA
      )
      res2 <- cbind(res2,res)       
          }
  }
}

end.time2 <- Sys.time()
time.taken2 <- end.time2 - start.time2
time.taken2

f.data2 <- cbind(f.data2,res2)       

f.data2$res.final <-  rowMeans(f.data2[,(ncol(f.data2)-ncol(res2)+1):(ncol(f.data2))],
                               na.rm=T)

f.data2 <- f.data2[, -c((ncol(f.data2)-ncol(res2)):(ncol(f.data2)-1)) ]


f.data2
f.data

sum(!(f.data$temp == f.data2$res.final), na.rm=T)

But because in reality I have a f.data with 88062 rows and a loc.list with 58 dfs that vary a lot in size ( 81–479 x 9–375 rows and columns, respec.), my 'optimized' code is still taking forever.但是因为实际上我有一个具有 88062 行的 f.data 和一个具有 58 个 dfs 的 loc.list，它们的大小变化很大（分别为 81-479 x 9-375 行和列），我的“优化”代码仍然是永远服用。 I'd appreciate it a lot if anyone could give an insight on how to make this faster.如果有人能提供有关如何使这更快的见解，我将不胜感激。 Txs.发送。 L大号

Answer 1

This sounds like a join, which can be done quite fast without a loop.这听起来像是一个连接，它可以在没有循环的情况下很快完成。 Here, I combine the list of data frames into one data frame with a location column holding the name of each original table.在这里，我将数据框列表组合成一个数据框，其中的location列包含每个原始表的名称。 Then we join (here with dplyr::left_join but could use base merge , or data.table or duckdb or collapse for more speed if needed).然后我们加入（这里使用dplyr::left_join但可以使用 base merge ，或data.table或duckdb ，如果需要，可以collapse以获得更快的速度）。

library(tidyverse) 
loc.list.df <- do.call(rbind.data.frame, loc.list) %>%
  rownames_to_column() %>%
  separate(rowname, c("location", "row"))

f.data %>%
  left_join(loc.list.df)

Result结果

Joining, by = c("location", "Y", "M")
   spp location depth    Y M  row  0 -1
1  spA     loc2     0 2000 1    1 13  3
2  spA     loc2     1 2001 2    2 18  7
3  spA     loc2     0 2000 2 <NA> NA NA
4  spB     loc1     0 2001 1 <NA> NA NA
5  spA     loc2     0 2001 1 <NA> NA NA
6  spB     loc1     0 2000 1    1 10  0
7  spB     loc2     1 2000 1    1 13  3
8  spB     loc1     1 2000 2 <NA> NA NA
9  spA     loc1     0 2000 1    1 10  0
10 spA     loc1     1 2001 1 <NA> NA NA

Answer 2

Thanks, @Limey and @Jon, for your comments.感谢@Limey 和@Jon 的评论。 It helped a lot.它有很大帮助。 First I followe Limey's suggestion and bound the list of dataframes in a single one.首先，我遵循 Limey 的建议，将数据框列表绑定在一个列表中。 (More elegant codes are probably available): （可能有更优雅的代码可用）：

loc.list.merged <- list()

for (j in 1:length(loc.list)) {

loc.list1 <- loc.list[[j]]

x.loc.list <- list()

for (i in 3:ncol(loc.list1)) {
  
  x <- data.frame(loc.list1[,i])
  names(x) <- 'temp'
  
  x$depth <- names(loc.list1)[i]
  x$Y <- loc.list1$Y
  x$M <- loc.list1$M
  x$locality <- names(loc.list[j])
  
  x.loc.list[[i-2]] <- x

  }
 
library(dplyr)
yy <- Reduce(full_join,x.loc.list)

loc.list.merged[[j]] <-  yy

}

loc.list.merged2 <- Reduce(full_join, loc.list.merged)
loc.list.merged2$depth <- as.numeric(loc.list.merged2$depth)*(-1)
names(loc.list.merged2)[5] <- 'location'

> loc.list.merged2
  temp depth    Y M location
1   10     0 2000 1     loc1
2   15     0 2001 2     loc1
3    0     1 2000 1     loc1
4    5     1 2001 2     loc1
5   13     0 2000 1     loc2
6   18     0 2001 2     loc2
7    3     1 2000 1     loc2
8    7     1 2001 2     loc2

Then I applied Jon's suggestion:然后我应用了乔恩的建议：


library(dplyr)
f.data4 <- 
  f.data3 %>% left_join(loc.list.merged2)

> f.data4
   spp location depth    Y M temp
1  spA     loc2     0 2000 1   13
2  spA     loc2     1 2001 2    7
3  spA     loc2     0 2000 2   NA
4  spB     loc1     0 2001 1   NA
5  spA     loc2     0 2001 1   NA
6  spB     loc1     0 2000 1   10
7  spB     loc2     1 2000 1    3
8  spB     loc1     1 2000 2   NA
9  spA     loc1     0 2000 1   10
10 spA     loc1     1 2001 1   NA

It's seems to work.它似乎工作。 I'll try tomorrow on my actual dataset.明天我会在我的实际数据集上尝试。

优化 R 中的 ifelse + 循环

问题描述

2 个解决方案

解决方案1
2 2022-09-19 17:26:31

解决方案2
0 2022-09-19 18:03:21

优化 R 中的 ifelse + 循环

问题描述

2 个解决方案

解决方案1 2 2022-09-19 17:26:31

解决方案2 0 2022-09-19 18:03:21

解决方案1
2 2022-09-19 17:26:31

解决方案2
0 2022-09-19 18:03:21