简体   繁体   中英

Way to check if each position in a string as part of a list of vectors is variable?

I have the following R list called codes :

>codes
$`1`
 [1] "000" "000" "111" "000" "100" "000" "100" "000" "100" "100"

$`2`
 [1] "000" "001" "110" "000" "000" "000" "000" "000" "000" "000"

$`3`
 [1] "000" "010" "100" "001" "001" "000" "000" "000" "001" "001"

$`4`
 [1] "000" "100" "000" "011" "011" "000" "000" "000" "011" "011"

What I am trying to do is implement a way to check, across the entire list, if each position of the strings in that list element is variable.

For example, for the first vector [1], it will check only the first number/character for each string item and identify if they all start with 0 or 1 (so non variable) or least one is different. Then it will look at the second character in each string and do the same. And so on for the third position.

My intuition is to use some combination of lapply(codes, strsplit, split = "") to get individual characters split apart along with length(unique(x))) != 1) to get it to spit out a TRUE/FALSE if any of the positions are variable or not, but I'm not sure of an efficient way to implement this. Ultimately, the dataset will be much larger with codes up 10 characters in length and I want to ensure that every position for each three letter code in the vector is variable and apply that to each vector element in the list.

I'd appreciate any help or suggestions.

Here is one approach based on string splitting:

l <- list(rep.int(strrep("0", 6L), 8L),
          rep.int(strrep(c("10", "01"), 3L), 4L),
          rep.int(strrep(c("10", "100"), c(3L, 2L)), c(4L, 4L)))
l
## [[1]]
## [1] "000000" "000000" "000000" "000000"
## [5] "000000" "000000" "000000" "000000"
## 
## [[2]]
## [1] "101010" "010101" "101010" "010101"
## [5] "101010" "010101" "101010" "010101"
## 
## [[3]]
## [1] "101010" "101010" "101010" "101010"
## [5] "100100" "100100" "100100" "100100"

f <- function(l) {
    m <- nchar(l[[1L]][1L])
    n <- length(l)
    f0 <- function(x) {
        matrix(unlist(strsplit(x, ""), FALSE, FALSE), m)
    }
    X <- do.call(rbind, lapply(l, f0))
    matrix(matrixStats::rowAnys(X != X[, 1L]), n, byrow = TRUE)
}
f(l)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE  TRUE  TRUE  TRUE FALSE

If your codes can be read as decimal numbers less than or equal to .Machine$integer.max , then you can optimize by replacing string splitting with integer arithmetic:

g <- function(l) {
    m <- length(l)
    n <- length(l[[1L]])
    N <- nchar(l[[1L]][1L])
    X <- matrix(as.integer(unlist(l, FALSE, FALSE)), m, n, byrow = TRUE)
    g0 <- function(pow) {
        Y <- X %/% pow
        X <<- X - pow * Y
        matrixStats::rowAnys(Y != Y[, 1L])
    }
    pow <- as.integer(10^((N - 1L):0))
    matrix(unlist(lapply(pow, g0), FALSE, FALSE), m, N)
}
g(l)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE  TRUE  TRUE  TRUE FALSE

If your codes are actually binary, then you can optimize slightly more and dispense with matrixStats :

h <- function(l) {
    m <- length(l[[1L]])
    n <- length(l)
    N <- nchar(l[[1L]][1L])
    X <- matrix(as.integer(unlist(l, FALSE, FALSE)), m, n)
    h0 <- function(p) {
        Y <- X %/% p
        X <<- X - p * Y
        .colSums(Y, m, n) %% m > 0L
    }
    pow <- as.integer(10^((N - 1L):0))
    matrix(unlist(lapply(pow, h0), FALSE, FALSE), n, N)
}
h(l)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE  TRUE  TRUE  TRUE FALSE

Here is a benchmark on a length-10000 list of length-8 character vectors of 6-digit binary codes.

ll <- rep_len(l, 1e+04L)
microbenchmark::microbenchmark(f(ll), g(ll), h(ll))
## Unit: milliseconds
##   expr       min        lq      mean    median        uq       max neval
##  f(ll) 41.583143 55.960510 66.201555 64.211679 73.542807 127.47810   100
##  g(ll)  8.612173  8.856123  9.725214  8.946077  9.116391  46.66698   100
##  h(ll)  7.622679  7.824789  8.717184  7.887519  7.987128  46.32225   100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM