简体   繁体   English

在r中使ifelse循环更快的方法

[英]Faster way to make this ifelse loop in r

I have a largish data frame in R [300000, 45]. 我在R [300000,45]中有一个比较大的数据帧。 I want to add a column (or create a vector) of TRUE/FALSE where a TRUE is assigned if the value of another column is different than the value above (i-1) and FALSE if they are the same. 我想添加一个TRUE / FALSE的列(或创建一个向量),如果另一列的值不同于上面的(i-1)和FALSE的值相同,则分配TRUE。 The basic R code would be: 基本的R代码为:

etS$ar1TF <- NA
mode(etS$ar1TF) <- 'logical'
etS$ar1TF[1] <- TRUE
for(i in 2:length(etS$ar1TF)) {
  if(etS$siteYear[i] == etS$siteYear[i-1]) {
    etS$ar1TF[i] <- FALSE
  } else {
    etS$ar1TF[i] <- TRUE
  }
}

However, this will be incredibly slow and inefficient. 但是,这将非常缓慢且效率低下。 Are there better ways to use existing functions or vectorization to do this quickly and efficiently? 是否有更好的方法使用现有功能或向量化来快速有效地完成此任务? I'm not sure if a while() statement would be any more efficient. 我不确定while()语句是否会更有效。 I suppose I could start by assigning everything as TRUE then using the if statement within a for loop and removing the else statement but this really isn't much better. 我想我可以先将所有内容赋为TRUE,然后在for循环中使用if语句并删除else语句,但这确实没有什么好。 I'm not sure if the apply function would be faster or more efficient in this case because the size and type are already assigned. 我不确定在这种情况下apply函数是否会更快或更有效,因为已经分配了大小和类型。

Make use of vectorization. 利用向量化。 Something like below will do the trick: 如下所示将达到目的:

ar1TF <- logical(length(siteYear))
ar1TF[-1] <- (siteYear[-1] != siteYear[-length(siteYear)])
ar1TF[1] <- NA

etS$ar1TF <- ar1TF # to add the column to the data.frame

EDIT : It seems that the diff solution may be a bit faster: 编辑 :似乎diff解决方案可能会快一点:

x <- sample(1:3, 100000, replace=TRUE)
library('microbenchmark')
microbenchmark({
   y1 <- logical(length(x))
   y1[-1] <- (x[-1] != x[-length(x)])
   y1[1] <- NA
},{
   y2 <- diff(x)
   y2 <- c(NA, y2 != 0)
})

## Unit: microseconds
## expr        min       lq    median       uq      max neval
## [!=]   1062.651 1070.690 1088.1935 1169.500 2367.582   100
## [diff]  811.121  821.443  844.3575  892.967 2244.022   100

You could use diff to perform the differencing: 您可以使用diff执行区别:

vec = sample(1:10, 100, replace = TRUE)
diff(vec) == 0
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[73] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE

The standard setting of diff uses a lag of 1, which is what you need. diff的标准设置使用1的滞后量。 To add it to your data.frame , you need to append an NA : 要将其添加到data.frame ,您需要附加一个NA

df$new_col = c(NA, diff(vec) == 0)

Some basic timings show that this is quite fast, also for larger vectors: 一些基本的时序表明,对于较大的向量,这也非常快:

> system.time(dum <- diff(sample(1:10, 10e3, replace = TRUE)) == 0)
   user  system elapsed 
  0.001   0.000   0.001 
> system.time(dum <- diff(sample(1:10, 10e5, replace = TRUE)) == 0)
   user  system elapsed 
  0.189   0.012   0.202 
> system.time(dum <- diff(sample(1:10, 10e7, replace = TRUE)) == 0)
   user  system elapsed 
  6.810   1.908  10.376 

So, with your datasize the processing time should be less than a second. 因此,使用您的数据大小,处理时间应少于一秒。 Note that these times include creating the test dataset, so the actually differencing is almost twice as fast. 请注意,这些时间包括创建测试数据集,因此实际差异几乎快两倍。

Performing a direct comparison with a for loop based solution shows the difference in speed: 与基于for循环的解决方案进行直接比较显示出速度差异:

diff_for_loop = function(vec) {
    result_vec = vec
    for(i in seq_along(vec)[-1]) {
      if(vec[i] == vec[i-1]) {
        result_vec <- FALSE
      } else {
        result_vec <- TRUE
      }
    }
    return(result_vec)
}
vec = sample(1:10, 10e5, replace = TRUE)
system.time(dum_for_loop <- diff_for_loop(vec))
#   user  system elapsed 
#  1.220   0.008   1.232 
system.time(dum_diff <- diff(vec) == 0)
#   user  system elapsed 
#  0.051   0.005   0.056 

Which makes the diff based solution 22 times faster. 这使得基于diff的解决方案快了22倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM