简体   繁体   English

如果在R中使用嵌套优化For循环

[英]Optimizing For loop with nested if in R

I am trying to merge multiple csv files into a single dataframe and trying to manipulate the resultant dataframe using a for loop. 我正在尝试将多个csv文件合并到单个数据框中,并尝试使用for循环来操纵结果数据框。 The resultant dataframe may have anywhere between 1,500,000 to 2,000,000 rows. 结果数据帧可能有1,500,000至2,000,000行之间的任意位置。

I am using the below code for the same. 我使用下面的代码相同。

setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path) 
{ 
  files = dir(path, pattern = '\\.csv', full.names = TRUE)
  tables = lapply(files, read_csv)
  do.call(rbind, tables)
}


Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)

for (i in 1:nrow(Data1))
{ 
  Data1$Category[i] = ""
  Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
                       ifelse(Data1$Days[i] <= 60, "31-60",
                       ifelse(Data1$Days[i] <= 90, "61-90",">90")))     

}

However the code is running for very long. 但是,代码运行了很长时间。 Is there a better and faster way of doing the same operation? 有没有更好,更快的方法来执行相同的操作?

We can make this more optimized by reading with fread from data.table and then using cut/findInterval . 通过从data.table读取fread ,然后使用cut/findInterval可以使此操作更加优化。 This will become more pronounced when it is run in multiple cores, nodes on a server where fread utilize all the nodes and execute parallelly 当它在服务器上的多个内核中运行时, fread将利用所有节点并并行执行,这将变得更加明显。

library(data.table)
merge_data <- function(path) { 
   files = dir(path, pattern = '\\.csv', full.names = TRUE)
  rbindlist(lapply(files, fread, select = c(8, 9, 17)))
 }

Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf), 
      labels = c("<=30", "31-60", "61-90", ">90"))]

You're already using dplyr , so why not just: 您已经在使用dplyr ,为什么不这样做:

Data = merge_data("D:/Projects") %>%
  select(8, 9, 17) %>%
  mutate(Category = cut(Days,
                        breaks = c(-Inf, 30, 60, 90, Inf), 
                        labels = c("<=30", "31-60", "61-90", ">90"))

Akrun is indeed right that fread is substantially faster read.csv. Akrun确实正确,因为fread的读取速度要快得多。

However, in addition to his post, I would also add that your for loop is totally unnecessary. 但是,除了他的文章外,我还要补充一点,您的for循环完全没有必要。 He replaced it with cut/findInterval, which I am not familiar with. 他用不熟悉的cut / findInterval代替了它。 In terms of simple R programming though, for loops are necessary when some factor in your calculation is changing by row. 但是,就简单的R编程而言,当计算中的某些因素逐行更改时,for循环是必需的。 However, in your code, this is not the case and there is no need for a for loop. 但是,在您的代码中并非如此,并且不需要for循环。

Essentially you are running a calculation up to 2 million times when you only need to run the calculation over the column once. 本质上,当您只需要对该列运行一次时,您最多可以运行200万次计算。

You can replace your for loop with something like this: 您可以使用以下内容替换您的for循环:

Data1$category = ifelse(Data1$Days <= 30, "<=30",
                 ifelse(Data1$Days <= 60, "31-60",
                 ifelse(Data1$Days <= 90, "61-90",">90")))

and your code will run waaaaaay faster 而且您的代码将更快地运行waaaaaay

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM