简体   繁体   English

用于测试短数字向量是否是R中长数字向量的一部分的函数

[英]Function to test if short numeric vector is a portion of long numeric vector in R

I am trying to test if a short numeric vector is a portion of a longer numeric vector. 我试图测试一个短数字向量是否是一个较长的数字向量的一部分。 For example, if a = c(2, 3) and b = c(1, 3, 2, 4, 2, 3, 1) , then I'm trying to find / think of a function that would answer the question: is a a part of b ? 例如,如果a = c(2, 3)b = c(1, 3, 2, 4, 2, 3, 1) ,那么我试图找到/想到一个能回答这个问题的函数:是a的一部分b The output should be TRUE . 输出应为TRUE

Alternatively, if c = c(1, 3, 2, 4, 1, 3, 1) then the output of "is a a part of c ?" 备选地,如果c = c(1, 3, 2, 4, 1, 3, 1)然后输出“是a的一部分c ?” should be FALSE . 应该是FALSE

match() doesn't do the job: match()不能完成这项工作:

match(a, b)

returns 回报

3  2

Nor does the %in% operator: %in%运算符中的%in%也不是:

TRUE  TRUE

I know there are options for string matching but I'd prefer not to work around this issue by converting to strings... 我知道有字符串匹配的选项,但我不想通过转换为字符串来解决这个问题...

Here's my crack at it 这是我对它的抨击

valInLong <- function(val, long){
  n.long <- length(long)
  n.val <- length(val)
  # Find where in the longer vector the first
  # element of val is.  This is so we can vectorize later
  first <- which(long == val[1])
  # If the first element is too near the end we don't care
  # about it
  first <- first[first <= n.long - n.val + 1]
  # sequence from 0 to n.val - 1 used for grabbing subsequences
  se <- seq_along(val)-1
  # Look at all subsequences starting at 'first' that go
  # the length of val and do an elementwise comparison.
  # If any match in all positions then the subsequence is
  # the sequence of interest.
  any(sapply(first, function(x){all(long[x+se] == val)}))
}


long <- rpois(1000, 5)
a <- c(123421, 232, 23423) # probably not in long

valInLong(a, long)
a <- long[34:100]
valInLong(a, long)

Here's an attempt. 这是一次尝试。 I don't think it's super fast, but it's not super slow either: 我不认为它超级快,但它也不是超慢:

a  = c(2,3)
b1 = c(1, 3, 2, 4, 2, 3, 1)
b2 = c(1, 3, 2, 4, 1, 3, 1)

ainb <- function(a,b) {
  any(apply( embed(b,length(a)), 1, function(x) all(rev(a)==x) ))
}
ainb(a,b1)
#[1] TRUE
ainb(a,b2)
#[1] FALSE

You could always brute force it, if your vectors aren't going to be too long: 如果你的向量不会太长,你总是可以强制它:

f <- function(a, b) {

    if(length(a)==0) return(TRUE)

    ix <- seq_along(b)

    for(i in seq_along(a)) {

        ix <- ix[which(a[i] == b[ix + i - 1])]
    }

    length(ix) > 0
}

f(a, b)
# [1] TRUE
f(a, c)
# [1] FALSE

Given that OP writes "I'd prefer not to work around this issue by converting to strings...", and the comment by @thelatemail ("converting to strings can be quite slow at times compared to other solutions. But I'll absolutely reserve my judgement depending on what solutions people come up with.") I got a little bit curious to see how a string-based solution performed. 鉴于OP写道“我不想通过转换为字符串来解决这个问题...”,而@thelatemail的评论(“转换为字符串有时可能会比其他解决方案慢很多。但我会绝对保留我的判断取决于人们提出的解决方案。“)我有点好奇,看看如何执行基于字符串的解决方案。 Not too badly it seems. 看起来并不太糟糕。

Here I use base grepl , and the stringi equivalent stri_detect_fixed . 这里我使用base greplstringi等效的stri_detect_fixed They are fastest for the original (short) vectors. 它们对于原始(短)向量来说是最快的。 @Dason's solution is fastest for medium sized vectors, and the for -loop is fastest for 'long' vectors. @ Dason的解决方案对于中等大小的向量for是最快的,对于'long'向量来说for forloop是最快的。

h1 <- function(val, long){
  grepl(pattern = paste0(val, collapse = ","), x = paste0(long, collapse = ","))
}

library(stringi)
h2 <- function(val, long){
  stri_detect_fixed(str = paste0(long, collapse = ","), pattern = paste0(val, collapse = ","))
}


a <- c(2, 3)
b <- c(1, 3, 2, 4, 2, 3, 1)
c <- c(1, 3, 2, 4, 1, 3, 1)

ainb(a, b) # thelatemail
valInLong(a, b) # dason
f(a, b) # pete
h1(a, b)
h2(a, b)

ainb(a, c)
valInLong(a, c)
f(a, c)
h1(a, c)
h2(a, c)

library(microbenchmark)
microbenchmark(ainb(a, b),
               valInLong(a, b),
               f(a, b),
               h1(a, b),
               h2(a, b),
               times = 10)
# Unit: microseconds
#            expr     min      lq     mean   median      uq     max neval cld
#      ainb(a, b) 201.471 202.611 223.5567 211.7350 223.139 318.932    10   c
# valInLong(a, b)  67.664  76.407  90.2437  89.5215  99.215 129.245    10  b 
#         f(a, b)  36.873  42.195  54.2833  44.2860  55.879 129.246    10 a  
#        h1(a, b)  22.809  25.470  32.1595  27.1795  28.510  74.887    10 a  
#        h2(a, b)  20.147  22.048  31.7794  24.5190  26.609  96.174    10 a 


# vectors from @Dason's answer
val <- c(123421, 232, 23423)
long <- rpois(1000, 5)
microbenchmark(ainb(val, long),
               valInLong(val, long),
               f(val, long),
               h1(val, long),
               h2(val, long),
               times = 10)
# Unit: microseconds
#                 expr       min        lq       mean     median        uq       max neval cld
#      ainb(val, long) 24673.332 24872.522 27732.2673 25685.4380 26962.877 45808.000    10   b
# valInLong(val, long)    50.558    55.880    68.5763    66.7135    81.349    91.233    10  a 
#         f(val, long)    69.945    80.588    89.1036    88.9515    99.215   115.561    10  a 
#        h1(val, long)   387.737   391.158   432.3644   421.5685   458.062   524.585    10  a 
#        h2(val, long)   337.559   342.120   378.1190   378.0425   382.035   458.442    10  a


# longer 'val' and 'long' vectors
val <- rpois(100, 5)
long <- rpois(10000, 5)
microbenchmark(ainb(val, long),
               valInLong(val, long),
               f(val, long),
               h1(val, long),
               h2(val, long),
               times = 10)
# Unit: milliseconds
#                 expr        min         lq       mean     median         uq        max neval cld
#      ainb(val, long) 298.967481 312.962860 322.350298 322.219875 329.194565 350.080246    10   b
# valInLong(val, long)   5.065280   5.237861   5.533719   5.532845   5.843414   5.921341    10  a 
#         f(val, long)   1.679050   1.717064   1.763288   1.747284   1.779786   1.907891    10  a 
#        h1(val, long)   3.648523   3.664869   3.751121   3.707634   3.753820   4.153720    10  a 
#        h2(val, long)   3.366463   3.444010   3.616591   3.478413   3.758761   4.309955    10  a

This is a variation on the clever answer by @thelatemail, as an infix operator: 这是@thelatemail作为中缀运算符的巧妙答案的变体:

`%w/in%` <- function(a, b)
{
    i <- length(a)
    x <- 1:(length(b)-(i-1))
    y <- x + (i-1)

    any(apply(cbind(x, y), 1, function(r) all(a == b[r[1]:r[2]])))
}

It sets up a set of indices to iterate through b , then passes over these to see if any if the selected subsets are all equal. 它设置了一组索引来迭代b ,然后传递这些索引以查看所选子集是否全部相等。 Because it creates these indices before iterating, it may be inefficient in large vectors. 因为它在迭代之前创建这些索引,所以在大向量中它可能是低效的。 Here it is in action. 这是在行动。

> a <- c(2, 3)
> b <- c(1, 3, 2, 4, 2, 3, 1)
> c <- c(1, 3, 2, 4, 1, 3, 1)
> 
> a %w/in% b
[1] TRUE
> a %w/in% c
[1] FALSE

For what it's worth, this version seems to be significantly faster (after very brief testing): 对于它的价值,这个版本似乎要快得多(经过非常简短的测试):

> a <- c(2, 3, 1)
> b <- sample(1:4, 1000, replace=TRUE)
> a %w/in% b
[1] TRUE
> ainb(a, b)
[1] TRUE
> system.time(replicate(1000, a %w/in% b))
   user  system elapsed 
 11.175   0.000  11.187 
> system.time(replicate(1000, ainb(a, b)))
   user  system elapsed 
 19.930   0.000  19.949 

One way is to exhaustively search the longer vector at all possible indices for a series of matches equal in length to the shorter vector. 一种方法是在所有可能的指数上穷举搜索较长的向量,以获得与较短向量长度相等的一系列匹配。 I doubt this way is efficient for very large problems and suspect that string conversion -- and also trying to simplify my own answer! 我怀疑这种方式对于非常大的问题是有效的,并怀疑字符串转换 - 并且还试图简化我自己的答案! -- would be worth investigating, but... - 值得调查,但......

compareTuple <- function(v.lng, v.shrt, idx)
    {
    #idx is starting index of v.lng to begin comparison
    len = length(v.shrt)
    prod(v.lng[idx:(idx+len-1)] == v.shrt)
    }

containsTuple <- function(v.lng, v.shrt)
    {
    as.logical(sum(sapply(
                        FUN = function(x){prod(compareTuple(v.lng, v.shrt, x))}, 
                        X = 1:(length(v.lng)-length(v.shrt)+1)
                         )))
    }

should do the trick. 应该做的伎俩。 Here's the results: 结果如下:

a = c(2, 3); b = c(1, 3, 2, 4, 2, 3, 1); c = c(1, 3, 2, 4, 1, 3, 1)

> containsTuple(c,a)
[1] FALSE
> containsTuple(b,a)
[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM