简体   繁体   中英

Faster way to make this ifelse loop in r

I have a largish data frame in R [300000, 45]. I want to add a column (or create a vector) of TRUE/FALSE where a TRUE is assigned if the value of another column is different than the value above (i-1) and FALSE if they are the same. The basic R code would be:

etS$ar1TF <- NA
mode(etS$ar1TF) <- 'logical'
etS$ar1TF[1] <- TRUE
for(i in 2:length(etS$ar1TF)) {
  if(etS$siteYear[i] == etS$siteYear[i-1]) {
    etS$ar1TF[i] <- FALSE
  } else {
    etS$ar1TF[i] <- TRUE
  }
}

However, this will be incredibly slow and inefficient. Are there better ways to use existing functions or vectorization to do this quickly and efficiently? I'm not sure if a while() statement would be any more efficient. I suppose I could start by assigning everything as TRUE then using the if statement within a for loop and removing the else statement but this really isn't much better. I'm not sure if the apply function would be faster or more efficient in this case because the size and type are already assigned.

Make use of vectorization. Something like below will do the trick:

ar1TF <- logical(length(siteYear))
ar1TF[-1] <- (siteYear[-1] != siteYear[-length(siteYear)])
ar1TF[1] <- NA

etS$ar1TF <- ar1TF # to add the column to the data.frame

EDIT : It seems that the diff solution may be a bit faster:

x <- sample(1:3, 100000, replace=TRUE)
library('microbenchmark')
microbenchmark({
   y1 <- logical(length(x))
   y1[-1] <- (x[-1] != x[-length(x)])
   y1[1] <- NA
},{
   y2 <- diff(x)
   y2 <- c(NA, y2 != 0)
})

## Unit: microseconds
## expr        min       lq    median       uq      max neval
## [!=]   1062.651 1070.690 1088.1935 1169.500 2367.582   100
## [diff]  811.121  821.443  844.3575  892.967 2244.022   100

You could use diff to perform the differencing:

vec = sample(1:10, 100, replace = TRUE)
diff(vec) == 0
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[73] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE

The standard setting of diff uses a lag of 1, which is what you need. To add it to your data.frame , you need to append an NA :

df$new_col = c(NA, diff(vec) == 0)

Some basic timings show that this is quite fast, also for larger vectors:

> system.time(dum <- diff(sample(1:10, 10e3, replace = TRUE)) == 0)
   user  system elapsed 
  0.001   0.000   0.001 
> system.time(dum <- diff(sample(1:10, 10e5, replace = TRUE)) == 0)
   user  system elapsed 
  0.189   0.012   0.202 
> system.time(dum <- diff(sample(1:10, 10e7, replace = TRUE)) == 0)
   user  system elapsed 
  6.810   1.908  10.376 

So, with your datasize the processing time should be less than a second. Note that these times include creating the test dataset, so the actually differencing is almost twice as fast.

Performing a direct comparison with a for loop based solution shows the difference in speed:

diff_for_loop = function(vec) {
    result_vec = vec
    for(i in seq_along(vec)[-1]) {
      if(vec[i] == vec[i-1]) {
        result_vec <- FALSE
      } else {
        result_vec <- TRUE
      }
    }
    return(result_vec)
}
vec = sample(1:10, 10e5, replace = TRUE)
system.time(dum_for_loop <- diff_for_loop(vec))
#   user  system elapsed 
#  1.220   0.008   1.232 
system.time(dum_diff <- diff(vec) == 0)
#   user  system elapsed 
#  0.051   0.005   0.056 

Which makes the diff based solution 22 times faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM