简体   繁体   中英

Optimizing R code to cbind rows to a data frame

I have two data frames. For every row in the first data frame (df), there are three corresponding rows in the second data frame (design). What the code needs to do, is take every row in df, match it to the three corresponding rows in design, and then append the three corresponding rows to a new data frame, along with some other variables that are needed.

The code I have thus far is:

df1 <- NULL
for(i in 1:nrow(df)){
  x <- c(design[which(df$version[i] == design$version & df$task[i] == design$task) , ])
  for(j in seq_along(x[[3]])){
    set <- NULL
    set <- cbind(t1 = 0,
                 t2 = 0,
                 t3 = 0,
                 t4 = 0,
                 resp_id = df$resp_id[i],
                 block = df$version[i],
                 task = x$task[j],
                 concept = x$concept[j],
                 brand = x$brand[j],
                 branch_type = x$type_of_branch[j], 
                 branch_prox = x$branch_prox[j], 
                 atm_prox = x$atm_prox[j],
                 atm_location_phy = x$atm_location_phys[j],
                 atm_fees = x$atm_fees[j],
                 service = x$service[j],
                 monthly_charge = x$monthly_charge[j],
                 checking_w_interest = x$checking_w_interest[j],
                 overdraft_prot = x$overdraft_prot[j],
                 benefits = x$benefits[j],
                 none = 0,
                 pick = ifelse(df$dc1[i] == x[[3]][[j]], 1, 0))
    df1 <- data.frame(rbind(df1, set))
  }
}

As you can probably tell just by looking at it, this code is extremely slow, and I need to reduce the run speed of this code significantly.

There are 55000+ observations in the first data frame, so I have been working with data tables (for speed) instead, and trying to use lapply to iterate through each element of the list x (x is 3 lists long, with 15 elements in each list). The code I have for this is:

df1 <- data.table(t1 = numeric(),
                  t2 = numeric(),
                  t3 = numeric(),
                  t4 = numeric(),
                  resp_id = numeric(),
                  block = numeric(),
                  task = numeric(),
                  concept = numeric(),
                  brand = numeric(),
                  branch_type = numeric(),
                  branch_prox = numeric(),
                  atm_prox = numeric(),
                  atm_location_phys = numeric(),
                  atm_location_digi = numeric(),
                  atm_fees = numeric(),
                  service = numeric(),
                  monthly_charge = numeric(),
                  checking_w_interest = numeric(),
                  overdraft_prot = numeric(),
                  benefits = numeric(),
                  none = numeric(),
                  pick = numeric())
df2 <- data.table()

for(i in 1:nrow(df)){
    set <- NULL
    x <- data.table(design[which(df$version[i] == design$version & df$task[i] == design$task) , ])

    set <- list(x[1], x[2], x[3])
    df1 <- data.table(do.call(rbind, lapply(seq_along(1:3), function(y){
        set.temp <- list(t1 = 0,
                         t2 = 0,
                         t3 = 0,
                         t4 = 0,
                         resp_id = df$resp_id[i],
                         block = df$version[i],
                         task = set[[y]]$task,
                         concept = set[[y]]$concept,
                         brand = set[[y]]$brand,
                         branch_type = set[[y]]$type_of_branch,
                         branch_prox = set[[y]] $branch_prox,
                         atm_prox = set[[y]]$atm_prox,
                         atm_location_phys = set[[y]]$atm_location_phys,
                         atm_location_digi = set[[y]]$atm_location_digi,
                         atm_fees = set[[y]]$atm_fees,
                         service = set[[y]]$service,
                         monthly_charge = set[[y]]$monthly_charge,
                         checking_w_interest = set[[y]]$checking_w_interest,
                         overdraft_prot = set[[y]]$overdraft_prot,
                         benefits = set[[y]]$benefits,
                         none = 0,
                         pick = ifelse(df$dc1[i] == set[[y]]$concept, 1, 0)) })))
    df2 <- rbind(df2, df1)
}

The first set of code took upwards of an hour + to run. The second code chunk is still running but will probably take around 45 minutes.

If you can weigh in and provide some pointers as to where I can speed up my code, I would greatly appreciate it.

How about a code below;

rbindlist(lapply(1:nrow(df), function(i) {
  x <- setDT(design[which(df$version[i] == design$version & df$task[i] == design$task), ])
  set <- list(x[1], x[2], x[3])
  df1 <- rbindlist(lapply(seq_along(1:3), function(y){
    data.table(
      t1 = 0,
      t2 = 0,
      t3 = 0,
      t4 = 0,
      resp_id = df$resp_id[i],
      block = df$version[i],
      task = set[[y]]$task,
      concept = set[[y]]$concept,
      brand = set[[y]]$brand,
      branch_type = set[[y]]$type_of_branch,
      branch_prox = set[[y]] $branch_prox,
      atm_prox = set[[y]]$atm_prox,
      atm_location_phys = set[[y]]$atm_location_phys,
      atm_location_digi = set[[y]]$atm_location_digi,
      atm_fees = set[[y]]$atm_fees,
      service = set[[y]]$service,
      monthly_charge = set[[y]]$monthly_charge,
      checking_w_interest = set[[y]]$checking_w_interest,
      overdraft_prot = set[[y]]$overdraft_prot,
      benefits = set[[y]]$benefits,
      none = 0,
      pick = ifelse(df$dc1[i] == set[[y]]$concept, 1, 0)
    )
  }))
  return(df1)
}))

rbindlist reduce process time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM