[英]Efficient way to calculate number of strings which contain a defined shorter string
I have a character vector containing short strings: 我有一个包含短字符串的字符向量:
short <- c("aaa", "bah", "dju", "kjs")
I want to count the number of strings in the following vector in which at least one of the above short strings is present. 我想计算下面向量中的字符串数,其中至少有一个上面的短字符串存在。
long <- c("aaajhd", "slilduaaadifh", "sldifjsdbahsdofiusd", "sdflisjdjukjs", "sldifjbak", "sdfoiuwebss", "sdkfuhsd", "sdlfihwoio")
The number it should output for this is 4, as 4 of the strings in the long
vector contain the shorter strings defined in the short
vector. 它应输出的数字是4,因为
long
向量中的4个字符串包含short
向量中定义的较短字符串。
My actual short vector is around 10000 strings and long is around 1000, so I am looking for an efficient way to calculate this. 我的实际短向量大约是10000个字符串,长大约是1000,所以我正在寻找一种有效的方法来计算它。
Thanks! 谢谢!
This takes about 0.12 seconds on my laptop where long
and short
are from the Note at the end and have lengths 10000 and 1000. No packages are used -- only to generate the sample data. 这发生在我的笔记本电脑约0.12 seconds其中
long
和short
距离末的注意,并有长度10000个1000没有软件包使用-只生成样本数据。
system.time(num <- length(grep(paste(short, collapse = "|"), long, perl = TRUE)))
user system elapsed
0.08 0.00 0.12
In comparison the Reduce/str_count solution takes 6.5 seconds. 相比之下,Reduce / str_count解决方案需要6.5秒。
Note: We take the first 1000 and 10000 words from the book Ulysses as the sample data. 注意:我们将Ulysses一书中的前1000个和10000个单词作为样本数据。
library(gsubfn)
u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ")
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)
We loop through the 'short' vector, get the str_count
and Reduce
it to a single logical vector to get the sum
我们遍历'short'向量,获取
str_count
并将其Reduce
为单个逻辑向量以获得sum
library(stringr)
sum(Reduce(`|`, lapply(short, str_count, string = long)))
#[1] 4
str_count
uses the stringi
functions and this don't depend on the length
of the vector
str_count
使用stringi
函数,这不依赖于vector
的length
This takes me only 0.09s with the data provided above. 使用上面提供的数据,这只需要0.09秒。
system.time(sum(sapply(regmatches(long, gregexpr(paste(short, collapse = "|"), long, ignore.case = F, perl = T)), length) >= 1))
User System verstrichen
0.09 0.00 0.09
Data: 数据:
library(gsubfn)
u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ")
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.