简体   繁体   中英

Optimizing For loop with nested if in R

I am trying to merge multiple csv files into a single dataframe and trying to manipulate the resultant dataframe using a for loop. The resultant dataframe may have anywhere between 1,500,000 to 2,000,000 rows.

I am using the below code for the same.

setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path) 
{ 
  files = dir(path, pattern = '\\.csv', full.names = TRUE)
  tables = lapply(files, read_csv)
  do.call(rbind, tables)
}


Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)

for (i in 1:nrow(Data1))
{ 
  Data1$Category[i] = ""
  Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
                       ifelse(Data1$Days[i] <= 60, "31-60",
                       ifelse(Data1$Days[i] <= 90, "61-90",">90")))     

}

However the code is running for very long. Is there a better and faster way of doing the same operation?

We can make this more optimized by reading with fread from data.table and then using cut/findInterval . This will become more pronounced when it is run in multiple cores, nodes on a server where fread utilize all the nodes and execute parallelly

library(data.table)
merge_data <- function(path) { 
   files = dir(path, pattern = '\\.csv', full.names = TRUE)
  rbindlist(lapply(files, fread, select = c(8, 9, 17)))
 }

Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf), 
      labels = c("<=30", "31-60", "61-90", ">90"))]

You're already using dplyr , so why not just:

Data = merge_data("D:/Projects") %>%
  select(8, 9, 17) %>%
  mutate(Category = cut(Days,
                        breaks = c(-Inf, 30, 60, 90, Inf), 
                        labels = c("<=30", "31-60", "61-90", ">90"))

Akrun is indeed right that fread is substantially faster read.csv.

However, in addition to his post, I would also add that your for loop is totally unnecessary. He replaced it with cut/findInterval, which I am not familiar with. In terms of simple R programming though, for loops are necessary when some factor in your calculation is changing by row. However, in your code, this is not the case and there is no need for a for loop.

Essentially you are running a calculation up to 2 million times when you only need to run the calculation over the column once.

You can replace your for loop with something like this:

Data1$category = ifelse(Data1$Days <= 30, "<=30",
                 ifelse(Data1$Days <= 60, "31-60",
                 ifelse(Data1$Days <= 90, "61-90",">90")))

and your code will run waaaaaay faster

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM