简体   繁体   中英

Get runs of consecutive integers of certain length and sample from first values

I am trying to create a function that will return the first integer of a subset of a vector such that the values of the subset are discrete, increasing by 1, and of a specified length.

For example, using the input data 'v' and a specified length 'l' of 3:

v <- c(3, 4, 5, 6, 15, 16, 25, 26, 27)
l <- 3

The possible sub-vectors of consecutive values of length 3 would be:

c(3, 4, 5)
c(4, 5, 6)
c(25, 26, 27)

Then I want to randomly choose one of these vectors and return the first/lowest number, ie 3, 4, or 25.

Here's an approach with base R :

First, we create all possible sub-vectors of length length . Next, we subset that list of vectors based on the cumsum of their difference equalling 1 . The is.na test ensures the last vectors which contain NA are also filtered out. Then we just bind the remaining vectors into a matrix and sample the first column.

SampleSequencialVectors <- function(vec, length){
  all.vecs <- lapply(seq_along(vec),function(x)vec[x:(x+(length-1))])
  seq.vec <- all.vecs[sapply(all.vecs,function(x) all(diff(x) == 1 & !is.na(diff(x))))]
  sample(do.call(rbind,seq.vec)[,1],1)
}

replicate(10, SampleSequencialVectors(v, 3))
# [1]  3  4  3  3  4  4 25 25  3 25

Or if you prefer a tidyverse type approach:

SampleSequencialVectorsPurrr <- function(vec, length){
  vec %>%
    seq_along %>%
    purrr::map(~vec[.x:(.x+(length-1))]) %>%
    purrr::keep(~ all(diff(.x) == 1 & !is.na(diff(.x)))) %>%
    purrr::invoke(rbind,.) %>%
    {sample(.[,1],size = 1)}
}
replicate(10, SampleSequencialVectorsPurrr(v, 3))
 [1]  4 25 25  3 25  4  4  3  4 25
  1. Split the vector into runs of consecutive values*: split(v, cumsum(c(1L, diff(v) != 1)))
  2. Select runs of length above or equal to the limit: runs[lengths(runs) >= lim]
  3. From each run, select the possible first values ( x[1:(length(x) - lim + 1)] ).
  4. From all possible first values, sample 1.

     runs = split(v, cumsum(c(1L, diff(v),= 1))) first = lapply(runs[lengths(runs) >= lim]: function(x) x[1,(length(x) - lim + 1)]) sample(unlist(first), 1)

Here we loop over runs of sufficient length, and not all individual values (see the other answers), thus it may be faster on larger vectors (haven't tested).


Slightly more compact using data.table :

 sample(data.table(v)[ , if(.N >= 3) v[1:(length(v) - lim + 1)],
                       by = .(cumsum(c(1L, diff(v) != 1)))]$V1, 1)

*Credits to the nice canonical: How to split a vector into groups of consecutive sequences? .

Base R two lines: Please note this solution assumes v is sorted.

consec_seq <- sapply(seq_along(v), function(i)split(v, abs(v - v[i]) > 1)[1])
consec_seq[lengths(consec_seq) == l][sample.int(l, 1)]

As a reusable function (not assuming sorted v):

conseq_split_sample <- function(vec, n){ 
  v <- sort(vec)
  consec_seq <- sapply(seq_along(v), function(i)split(v, abs(v - v[i]) > 1)["FALSE"])
  consec_seq[lengths(consec_seq) == n][sample.int(n, 1)]
}
conseq_split_sample(v, l)

Data:

 l <- 3
 v <- c(3, 4, 5, 6, 15, 16, 25, 26, 27)

Tooting my own horn -- cgwtools::seqle is like rle but you can specify the desired increment in a run. seqle(x, incr = 0,..) is the same as rle(x)

Then just grab the run lengths and starting values from the result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM