简体   繁体   中英

selecting only the 0s and the first 1 from a sequence of many 0s and few 1s in R?

I have a sequence of 0s and 1s in this manner:

xx <- c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 
                    0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1)

And I want to select the 0s and the first 1s.

The results should be:

ans <- c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1)

What's the fastest way? in R

Use rle() to extract the run lengths and values, do some minor surgery, and then put the run-length encoded vector "back together" using inverse.rle() .

rr <- rle(xx)
rr$lengths[rr$values==1] <- 1
inverse.rle(rr)
#  [1] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1

Here's one way:

idx <- which(xx == 1)
pos <- which(diff(c(xx[1], idx)) == 1)
xx[-idx[pos]] # following Frank's suggestion
# [1] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1

Without rle:

xx[head(c(TRUE, (xx != 1)), -1) | (xx != 1)]
#[1] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1

Since OP mentioned speed, here's a benchmark:

josh = function(xx) {
  rr <- rle(xx)
  rr$lengths[rr$values==1] <- 1
  inverse.rle(rr)
}

arun = function(xx) {
  idx <- which(xx == 1)
  pos <- which(diff(c(xx[1], idx)) == 1)
  xx[setdiff(seq_along(xx), idx[pos])]
}

eddi = function(xx) {
  xx[head(c(TRUE, (xx != 1)), -1) | (xx != 1)]
}

simon = function(xx) {
    #  The body of the function is supplied in @SimonO101's answer
    first1(xx)
}

set.seed(1)
N = 1e6    
xx = sample(c(0,1), N, T)

library(microbenchmark)
bm <- microbenchmark(josh(xx), arun(xx), eddi(xx), simon(xx) , times = 25)
print( bm , digits = 2 , order = "median" )
#Unit: milliseconds
#      expr min  lq median  uq max neval
# simon(xx)  20  21     23  26  72    25
#  eddi(xx)  97 102    104 118 149    25
#  arun(xx) 205 245    253 258 332    25
#  josh(xx) 228 268    275 287 365    25

Here's a quick Rcpp solution. Should be fastish (but I've no idea how it will stack up against the others here)...

Rcpp::cppFunction( 'std::vector<int> first1( IntegerVector x ){
    std::vector<int> out;
    for( IntegerVector::iterator it = x.begin(); it != x.end(); ++it ){
        if( *it == 1 && *(it-1) != 1 || *it == 0  )
          out.push_back(*it);
    }
    return out;
}')

first1(xx)
# [1] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1

Even tho' I'm a staunch supporter of rle , since it's Friday here's an alternative method. I did it for fun, so YMMV.

yy<-paste(xx,collapse='')
zz<-gsub('[1]{1,}','1',yy)  #I probably screwed up the regex here
aa<- as.numeric(strsplit(zz,'')[[1]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM