简体   繁体   English

更快的替代`range(which(..))`

[英]Faster alternative to `range(which(..))`

Let be a sequence of TRUE and FALSE in R 设R中的序列为TRUE和FALSE

v = c(F,F,F,F,F,F,T,F,T,T,F,T,T,T,T,T,F,T,F,T,T,F,F,F,T,F,F,F,F,F)

I would like to get the the positions of the first and the last TRUE. 我想获得第一个和最后一个TRUE的位置。 One way to achieve this is 实现这一目标的一种方法是

range(which(v)) # 7 25

but this solution is relatively slow as it must check every element of the vector to get the position of each TRUE and then loop over all positions, evaluating two if statements at each position (I think) in order to get the maximum and the minimum values. 但是这个解决方案相对较慢,因为它必须检查向量的每个元素以获得每个TRUE的位置然后遍历所有位置,在每个位置评估两个if语句(我认为)以获得最大值和最小值。 It would be much more strategic to search for the first TRUE starting one from the beginning and one from the end and just return those positions. 从头开始搜索第一个TRUE,从头开始搜索第一个TRUE并返回那些位置将更具战略意义。

Is there a faster alternative to range(which(..)) ? 是否有更快的替代range(which(..))

The simplest approach I can think of that doesn't involve searching the entire vector would be an Rcpp solution: 我能想到的最简单的方法是不涉及搜索整个向量,这将是一个Rcpp解决方案:

library(Rcpp)
cppFunction(
"NumericVector rangeWhich(LogicalVector x) {
  NumericVector ret(2, NumericVector::get_na());
  int n = x.size();
  for (int idx=0; idx < n; ++idx) {
    if (x[idx]) {
      ret[0] = idx+1;  // 1-indexed for R
      break;
    }
  }
  if (R_IsNA(ret[0]))  return ret;  // No true values
  for (int idx=n-1; idx >= 0; --idx) {
    if (x[idx]) {
      ret[1] = idx + 1;  // 1-indexed for R
      break;
    }
  }
  return ret;
}")
rangeWhich(v)
# [1]  7 25

We can benchmark on a fairly long vector (length 1 million) with random entries. 我们可以使用随机条目对相当长的向量(长度为100万)进行基准测试。 We would expect to get pretty large efficiency gains from not searching through the whole thing with which : 我们希望从通过与整个事情不是搜索得到相当大的提高效率which

set.seed(144)
bigv <- sample(c(F, T), 1000000, replace=T)
library(microbenchmark)
# range_find from @PierreLafortune
range_find <- function(v) {
i <- 1
while(!v[i]) {
  i <- i +1
}
j <- length(v)
while(!v[j]) {
  j <- j-1
}
c(i,j)
}
# shortCircuit from @JoshuaUlrich
shortCircuit <- compiler::cmpfun({
  function(x) {
    first <- 1
    while(TRUE) if(x[first]) break else first <- first+1
    last <- length(x)
    while(TRUE) if(x[last]) break else last <- last-1
    c(first, last)
  }
})
microbenchmark(rangeWhich(bigv), range_find(bigv), shortCircuit(bigv), range(which(bigv)))
# Unit: microseconds
#                expr      min        lq        mean     median         uq       max neval
#    rangeWhich(bigv)    1.476    2.4655     9.45051     9.0640    13.7585    46.286   100
#    range_find(bigv)    1.445    2.2930     8.06993     7.2055    11.8980    26.893   100
#  shortCircuit(bigv)    1.114    1.6920     7.30925     7.0440    10.2210    30.758   100
#  range(which(bigv)) 6821.180 9389.1465 13991.84613 10007.9045 16698.2230 58112.490   100

The Rcpp solution is a good deal faster (more than 500x faster) than max(which(v)) because it doesn't need to iterate through the whole vector with which . 该RCPP解决方案是一个很好的协议更快(超过500倍的速度)比max(which(v))因为它并不需要通过与全矢量迭代which For this example it has a near-identical runtime (in fact, slightly slower) than range_find from @PierreLafortune and shortCircuit from @JoshuaUlrich. 对于此示例,它与range_find的range_find和shortCircuit的shortCircuit具有几乎相同的运行时间(实际上稍慢)。

Using Joshua's excellent example of some worst-case behavior where the true value is in the very middle of the vector (I'm repeating his experiment with all proposed functions so we can see the whole picture), we see a very different situation: 使用约书亚的一些最坏情况行为的优秀例子,其中真值是在向量的中间(我正在重复他对所有提议函数的实验,所以我们可以看到整个图片),我们看到一个非常不同的情况:

bigv2 <- rep(FALSE, 1e6)
bigv2[5e5-1] <- TRUE
bigv2[5e5+1] <- TRUE
microbenchmark(rangeWhich(bigv2), range_find(bigv2), shortCircuit(bigv2), range(which(bigv2)))
# Unit: microseconds
#                 expr        min          lq        mean      median         uq        max neval
#    rangeWhich(bigv2)    546.206    555.3820    593.1385    575.3790    599.055    979.924   100
#    range_find(bigv2) 400057.083 406449.0075 434515.1142 411881.4145 427487.041 697529.163   100
#  shortCircuit(bigv2)  74942.612  75663.7835  79095.3795  76761.5325  79703.265 125054.360   100
#  range(which(bigv2))    632.086    679.0955    761.9610    700.1365    746.509   3924.941   100

For this vector the looping base R solutions are much slower than the original solution (100-600x slower) and the Rcpp solution is barely faster than range(which(bigv2)) (which makes sense, because they're both looping through the whole vector once). 对于这个向量,循环基R解决方案比原始解决方案慢得多(100-600x慢)并且Rcpp解决方案几乎比range(which(bigv2))range(which(bigv2)) (这是有道理的,因为它们都在整个循环中矢量一次)。

As usual, this needs to come with a disclaimer -- you need to compile your Rcpp function, which also takes time, so this will only be a benefit if you have very large vectors or are repeating this operation many times. 像往常一样,这需要一个免责声明 - 你需要编译你的Rcpp函数,这也需要时间,所以这只有一个好处,如果你有非常大的向量或多次重复此操作。 From the comments on your question it sounds like you indeed have a large number of large vectors, so this could be a good option for you. 从您对问题的评论来看,您确实拥有大量的大型向量,因此这对您来说可能是一个不错的选择。

match is quick as it stops when it finds the value searched for: match很快,因为它在找到搜索的值时停止:

c(match(T,v),length(v)-match(T,rev(v))+1)
[1]  7 25

But you would have to test the speeds. 但你必须测试速度。

Update: 更新:

range_find <- function(v) {
i <- 1
j <- length(v)
while(!v[i]) {
  i <- i+1
}
while(!v[j]) {
  j <- j-1
}
c(i,j)
}

Benchmark 基准

v <- rep(v, 5e4)
microbenchmark(
  rangeWhich = rangeWhich(v),
  range_find = range_find(v),
  richwhich = {w <- which(v)
               w[c(1L, length(w))]},
  match = c(match(T,v),length(v)-match(T,rev(v))+1)
)
Unit: microseconds
       expr       min         lq        mean    median         uq        max neval
 rangeWhich     1.284     3.2090    16.50914    20.211    26.7875     29.836   100
 range_find     9.945    21.4945    32.02652    26.948    34.1660    144.042   100
  richwhich  2941.756  3022.5975  3243.02081  3130.227  3247.6405   5403.911   100
      match 45696.329 46771.8175 50662.45708 47359.526 48718.6055 131439.661   100

This approach matches your proposed strategy: 此方法符合您提出的策略:

"It would be much more strategic to search for the first TRUE starting one from the beginning and one from the end and just return those positions." “从头开始搜索第一个TRUE,从头开始搜索第一个TRUE,然后返回那些位置将会更具战略意义。”

Just for fun. 纯娱乐。 The simplest approach I can think of that doesn't involve searching the entire vector or Rcpp :P 我能想到的最简单的方法不涉及搜索整个向量 Rcpp:P

shortCircuit <- compiler::cmpfun({
  function(x) {
    first <- 1
    while(TRUE) if(x[first]) break else first <- first+1
    last <- length(x)
    while(TRUE) if(x[last]) break else last <- last-1
    c(first, last)
  }
})
set.seed(144)
bigv <- sample(c(F, T), 1000000, replace=T)
library(microbenchmark)
microbenchmark(rangeWhich(bigv), shortCircuit(bigv))
# Unit: microseconds
#                expr   min     lq median     uq   max neval
#    rangeWhich(bigv) 1.722 1.8875 1.9995 2.1400 6.850   100
#  shortCircuit(bigv) 1.053 1.1905 1.3245 1.4545 9.207   100

Woohoo, I win! 哇哦,我赢了! Oh, wait... let's compare the two on the worst possible case. 哦,等等......让我们在最糟糕的情况下比较两者。

v <- rep(FALSE, 1e6)
v[5e5-1] <- TRUE
v[5e5+1] <- TRUE
library(microbenchmark)
microbenchmark(rangeWhich(v), shortCircuit(v))
# Unit: microseconds
#             expr       min         lq    median        uq       max neval
#    rangeWhich(v)   751.252   884.8805  1109.527  1115.995  1163.135   100
#  shortCircuit(v) 60712.586 61004.2760 61396.715 61994.517 72382.216   100

Oh no... I lost, badly. 哦不...我输了,很糟糕。 Oh well, at least I had fun. 哦,好吧,至少我很开心。 :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM