[英]Function to test if short numeric vector is a portion of long numeric vector in R
I am trying to test if a short numeric vector is a portion of a longer numeric vector. 我试图测试一个短数字向量是否是一个较长的数字向量的一部分。 For example, if a = c(2, 3)
and b = c(1, 3, 2, 4, 2, 3, 1)
, then I'm trying to find / think of a function that would answer the question: is a
a part of b
? 例如,如果a = c(2, 3)
和b = c(1, 3, 2, 4, 2, 3, 1)
,那么我试图找到/想到一个能回答这个问题的函数:是a
的一部分b
? The output should be TRUE
. 输出应为TRUE
。
Alternatively, if c = c(1, 3, 2, 4, 1, 3, 1)
then the output of "is a
a part of c
?" 备选地,如果c = c(1, 3, 2, 4, 1, 3, 1)
然后输出“是a
的一部分c
?” should be FALSE
. 应该是FALSE
。
match()
doesn't do the job: match()
不能完成这项工作:
match(a, b)
returns 回报
3 2
Nor does the %in%
operator: %in%
运算符中的%in%
也不是:
TRUE TRUE
I know there are options for string matching but I'd prefer not to work around this issue by converting to strings... 我知道有字符串匹配的选项,但我不想通过转换为字符串来解决这个问题...
Here's my crack at it 这是我对它的抨击
valInLong <- function(val, long){
n.long <- length(long)
n.val <- length(val)
# Find where in the longer vector the first
# element of val is. This is so we can vectorize later
first <- which(long == val[1])
# If the first element is too near the end we don't care
# about it
first <- first[first <= n.long - n.val + 1]
# sequence from 0 to n.val - 1 used for grabbing subsequences
se <- seq_along(val)-1
# Look at all subsequences starting at 'first' that go
# the length of val and do an elementwise comparison.
# If any match in all positions then the subsequence is
# the sequence of interest.
any(sapply(first, function(x){all(long[x+se] == val)}))
}
long <- rpois(1000, 5)
a <- c(123421, 232, 23423) # probably not in long
valInLong(a, long)
a <- long[34:100]
valInLong(a, long)
Here's an attempt. 这是一次尝试。 I don't think it's super fast, but it's not super slow either: 我不认为它超级快,但它也不是超慢:
a = c(2,3)
b1 = c(1, 3, 2, 4, 2, 3, 1)
b2 = c(1, 3, 2, 4, 1, 3, 1)
ainb <- function(a,b) {
any(apply( embed(b,length(a)), 1, function(x) all(rev(a)==x) ))
}
ainb(a,b1)
#[1] TRUE
ainb(a,b2)
#[1] FALSE
You could always brute force it, if your vectors aren't going to be too long: 如果你的向量不会太长,你总是可以强制它:
f <- function(a, b) {
if(length(a)==0) return(TRUE)
ix <- seq_along(b)
for(i in seq_along(a)) {
ix <- ix[which(a[i] == b[ix + i - 1])]
}
length(ix) > 0
}
f(a, b)
# [1] TRUE
f(a, c)
# [1] FALSE
Given that OP writes "I'd prefer not to work around this issue by converting to strings...", and the comment by @thelatemail ("converting to strings can be quite slow at times compared to other solutions. But I'll absolutely reserve my judgement depending on what solutions people come up with.") I got a little bit curious to see how a string-based solution performed. 鉴于OP写道“我不想通过转换为字符串来解决这个问题...”,而@thelatemail的评论(“转换为字符串有时可能会比其他解决方案慢很多。但我会绝对保留我的判断取决于人们提出的解决方案。“)我有点好奇,看看如何执行基于字符串的解决方案。 Not too badly it seems. 看起来并不太糟糕。
Here I use base
grepl
, and the stringi
equivalent stri_detect_fixed
. 这里我使用base
grepl
和stringi
等效的stri_detect_fixed
。 They are fastest for the original (short) vectors. 它们对于原始(短)向量来说是最快的。 @Dason's solution is fastest for medium sized vectors, and the for
-loop is fastest for 'long' vectors. @ Dason的解决方案对于中等大小的向量for
是最快的,对于'long'向量来说for
forloop是最快的。
h1 <- function(val, long){
grepl(pattern = paste0(val, collapse = ","), x = paste0(long, collapse = ","))
}
library(stringi)
h2 <- function(val, long){
stri_detect_fixed(str = paste0(long, collapse = ","), pattern = paste0(val, collapse = ","))
}
a <- c(2, 3)
b <- c(1, 3, 2, 4, 2, 3, 1)
c <- c(1, 3, 2, 4, 1, 3, 1)
ainb(a, b) # thelatemail
valInLong(a, b) # dason
f(a, b) # pete
h1(a, b)
h2(a, b)
ainb(a, c)
valInLong(a, c)
f(a, c)
h1(a, c)
h2(a, c)
library(microbenchmark)
microbenchmark(ainb(a, b),
valInLong(a, b),
f(a, b),
h1(a, b),
h2(a, b),
times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# ainb(a, b) 201.471 202.611 223.5567 211.7350 223.139 318.932 10 c
# valInLong(a, b) 67.664 76.407 90.2437 89.5215 99.215 129.245 10 b
# f(a, b) 36.873 42.195 54.2833 44.2860 55.879 129.246 10 a
# h1(a, b) 22.809 25.470 32.1595 27.1795 28.510 74.887 10 a
# h2(a, b) 20.147 22.048 31.7794 24.5190 26.609 96.174 10 a
# vectors from @Dason's answer
val <- c(123421, 232, 23423)
long <- rpois(1000, 5)
microbenchmark(ainb(val, long),
valInLong(val, long),
f(val, long),
h1(val, long),
h2(val, long),
times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# ainb(val, long) 24673.332 24872.522 27732.2673 25685.4380 26962.877 45808.000 10 b
# valInLong(val, long) 50.558 55.880 68.5763 66.7135 81.349 91.233 10 a
# f(val, long) 69.945 80.588 89.1036 88.9515 99.215 115.561 10 a
# h1(val, long) 387.737 391.158 432.3644 421.5685 458.062 524.585 10 a
# h2(val, long) 337.559 342.120 378.1190 378.0425 382.035 458.442 10 a
# longer 'val' and 'long' vectors
val <- rpois(100, 5)
long <- rpois(10000, 5)
microbenchmark(ainb(val, long),
valInLong(val, long),
f(val, long),
h1(val, long),
h2(val, long),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# ainb(val, long) 298.967481 312.962860 322.350298 322.219875 329.194565 350.080246 10 b
# valInLong(val, long) 5.065280 5.237861 5.533719 5.532845 5.843414 5.921341 10 a
# f(val, long) 1.679050 1.717064 1.763288 1.747284 1.779786 1.907891 10 a
# h1(val, long) 3.648523 3.664869 3.751121 3.707634 3.753820 4.153720 10 a
# h2(val, long) 3.366463 3.444010 3.616591 3.478413 3.758761 4.309955 10 a
This is a variation on the clever answer by @thelatemail, as an infix operator: 这是@thelatemail作为中缀运算符的巧妙答案的变体:
`%w/in%` <- function(a, b)
{
i <- length(a)
x <- 1:(length(b)-(i-1))
y <- x + (i-1)
any(apply(cbind(x, y), 1, function(r) all(a == b[r[1]:r[2]])))
}
It sets up a set of indices to iterate through b
, then passes over these to see if any if the selected subsets are all equal. 它设置了一组索引来迭代b
,然后传递这些索引以查看所选子集是否全部相等。 Because it creates these indices before iterating, it may be inefficient in large vectors. 因为它在迭代之前创建这些索引,所以在大向量中它可能是低效的。 Here it is in action. 这是在行动。
> a <- c(2, 3)
> b <- c(1, 3, 2, 4, 2, 3, 1)
> c <- c(1, 3, 2, 4, 1, 3, 1)
>
> a %w/in% b
[1] TRUE
> a %w/in% c
[1] FALSE
For what it's worth, this version seems to be significantly faster (after very brief testing): 对于它的价值,这个版本似乎要快得多(经过非常简短的测试):
> a <- c(2, 3, 1)
> b <- sample(1:4, 1000, replace=TRUE)
> a %w/in% b
[1] TRUE
> ainb(a, b)
[1] TRUE
> system.time(replicate(1000, a %w/in% b))
user system elapsed
11.175 0.000 11.187
> system.time(replicate(1000, ainb(a, b)))
user system elapsed
19.930 0.000 19.949
One way is to exhaustively search the longer vector at all possible indices for a series of matches equal in length to the shorter vector. 一种方法是在所有可能的指数上穷举搜索较长的向量,以获得与较短向量长度相等的一系列匹配。 I doubt this way is efficient for very large problems and suspect that string conversion -- and also trying to simplify my own answer! 我怀疑这种方式对于非常大的问题是有效的,并怀疑字符串转换 - 并且还试图简化我自己的答案! -- would be worth investigating, but... - 值得调查,但......
compareTuple <- function(v.lng, v.shrt, idx)
{
#idx is starting index of v.lng to begin comparison
len = length(v.shrt)
prod(v.lng[idx:(idx+len-1)] == v.shrt)
}
containsTuple <- function(v.lng, v.shrt)
{
as.logical(sum(sapply(
FUN = function(x){prod(compareTuple(v.lng, v.shrt, x))},
X = 1:(length(v.lng)-length(v.shrt)+1)
)))
}
should do the trick. 应该做的伎俩。 Here's the results: 结果如下:
a = c(2, 3); b = c(1, 3, 2, 4, 2, 3, 1); c = c(1, 3, 2, 4, 1, 3, 1)
> containsTuple(c,a)
[1] FALSE
> containsTuple(b,a)
[1] TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.