简体   繁体   中英

Is there an efficient way to backward search a large vector in R?

I have a vector of 10+ million elements. I need to find all elements satisfying a given condition A (eg X < 2 at rows i %in% c(6,10) ). From each of these elements I need to skim the vector backwards and flag all preceding elements while they satisfy condition B (eg X < 4 for i %in% c(8:10) and c(5:6) ).

For example, given the following X column, I would like the final result to be the flag2 column. I am not interested in elements where B is true if they are not immediately preceding an element satisfying A, therefore row i == 2 has flag2 == 0 .

  i  |  X  | flag1 | flag2
---------------------------
  1  |  4  |   0   |   0
  2  |  3  |   0   |   0
  3  |  6  |   0   |   0
  4  |  9  |   0   |   0
  5  |  3  |   0   |   1
  6  |  1  |   1   |   1
  7  |  9  |   0   |   0
  8  |  3  |   0   |   1
  9  |  2  |   0   |   1
 10  |  1  |   1   |   1

The first operation to produce flag1 is simple and very fast:

# locate all occurrences of X < 2
my_data$flag1 = dplyr::case_when(my_data$X < 2 ~ 1, T ~ 0)

I have implemented the second operation with the following for loop, which gives the desired result but is prohibitively time-consuming given the amount of data.

# flag all elements preceding the ones already flagged while they satisfy `X < 4`
my_data$flag2 = my_data$flag1
for(i in nrow(my_data):2){
  if((my_data[i,]$flag2 == 1) & (my_data[i-1,]$X < 4)){ 
    my_data[i-1,]$flag2 = 1
  }
}

Is there any way I could do this more efficiently?

Hope the following can seed it up. It is subsetting and shifting the index of flag by one position like and repeating it until it does not flag anymore:

my_data  <- data.frame(X=c(4,3,6,9,3,1,9,3,2,1))

my_data$flag1 <- my_data$X < 2
my_data$flag2 <-  my_data$flag1
repeat {
  tt <- my_data$X < 4 & c(my_data$flag2[-1], FALSE)
  if(all(!(tt & !my_data$flag2))) break
  my_data$flag2[tt]  <- TRUE
}
my_data
   X flag1 flag2
1  4 FALSE FALSE
2  3 FALSE FALSE
3  6 FALSE FALSE
4  9 FALSE FALSE
5  3 FALSE  TRUE
6  1  TRUE  TRUE
7  9 FALSE FALSE
8  3 FALSE  TRUE
9  2 FALSE  TRUE
10 1  TRUE  TRUE

or using Reduce :

my_data  <- data.frame(X=c(4,3,6,9,3,1,9,3,2,1))

my_data$flag1 <- my_data$X < 2
my_data  <- my_data[nrow(my_data):1,]
fun <- function(x, y) {c(y[[1]] || (x[[1]] && y[[2]]), FALSE)}
my_data$flag2  <- do.call(rbind, Reduce(fun
  , as.data.frame(rbind(my_data$flag1, my_data$X < 4))[,-1]
  , c(my_data$flag1[1], FALSE), accumulate = TRUE))[,1]
my_data  <- my_data[nrow(my_data):1,]
my_data
#   X flag1 flag2
#1  4 FALSE FALSE
#2  3 FALSE FALSE
#3  6 FALSE FALSE
#4  9 FALSE FALSE
#5  3 FALSE  TRUE
#6  1  TRUE  TRUE
#7  9 FALSE FALSE
#8  3 FALSE  TRUE
#9  2 FALSE  TRUE
#10 1  TRUE  TRUE

here is another possibility using the accumulate function from the purrr package:

library(tidyverse)

my_data  <- data.frame(X=c(4,3,6,9,3,1,9,3,2,1))

my_fun <- function(flag1, xlag) if ((flag1 == 1 & xlag < 4) | xlag < 2) 1 else 0

my_data %>%
  mutate(flag1 = if_else(X < 2, 1, 0),
         flag2 = rev(accumulate(rev(X), my_fun, .init = last(flag1))[-1]))

   X flag1 flag2
1  4     0     0
2  3     0     0
3  6     0     0
4  9     0     0
5  3     0     1
6  1     1     1
7  9     0     0
8  3     0     1
9  2     0     1
10 1     1     1

If you are ok with using the data.table package, then it takes less than 1s for 10million rows using:

library(data.table)
nr <- 10e6
set.seed(0L)
my_data <- data.frame(X=sample(1:9, nr, TRUE))

system.time({
    setDT(my_data)[, flag2 := { 
        flag1 <- X < 2
        b <- rleid(X < 4)
        +(b %in% b[flag1])
    }]
})

#   user  system elapsed 
#   0.30    0.12    0.42 

output:

          X flag2
       1: 9     0
       2: 4     0
       3: 7     0
       4: 1     1
       5: 2     1
      ---        
 9999996: 6     0
 9999997: 1     1
 9999998: 9     0
 9999999: 6     0
10000000: 1     1

head(my_data, 10) :

    X flag2
 1: 9     0
 2: 4     0
 3: 7     0
 4: 1     1
 5: 2     1
 6: 7     0
 7: 2     1
 8: 3     1
 9: 1     1
10: 5     0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM