简体   繁体   English

计算包含已定义的较短字符串的字符串数的有效方法

[英]Efficient way to calculate number of strings which contain a defined shorter string

I have a character vector containing short strings: 我有一个包含短字符串的字符向量:

short <- c("aaa", "bah", "dju", "kjs")

I want to count the number of strings in the following vector in which at least one of the above short strings is present. 我想计算下面向量中的字符串数,其中至少有一个上面的短字符串存在。

long <- c("aaajhd", "slilduaaadifh", "sldifjsdbahsdofiusd", "sdflisjdjukjs", "sldifjbak", "sdfoiuwebss", "sdkfuhsd", "sdlfihwoio")

The number it should output for this is 4, as 4 of the strings in the long vector contain the shorter strings defined in the short vector. 它应输出的数字是4,因为long向量中的4个字符串包含short向量中定义的较短字符串。

My actual short vector is around 10000 strings and long is around 1000, so I am looking for an efficient way to calculate this. 我的实际短向量大约是10000个字符串,长大约是1000,所以我正在寻找一种有效的方法来计算它。

Thanks! 谢谢!

This takes about 0.12 seconds on my laptop where long and short are from the Note at the end and have lengths 10000 and 1000. No packages are used -- only to generate the sample data. 这发生在我的笔记本电脑约0.12 seconds其中longshort距离末的注意,并有长度10000个1000没有软件包使用-只生成样本数据。

system.time(num <- length(grep(paste(short, collapse = "|"), long, perl = TRUE)))
   user  system elapsed 
   0.08    0.00    0.12 

In comparison the Reduce/str_count solution takes 6.5 seconds. 相比之下,Reduce / str_count解决方案需要6.5秒。

Note: We take the first 1000 and 10000 words from the book Ulysses as the sample data. 注意:我们将Ulysses一书中的前1000个和10000个单词作为样本数据。

library(gsubfn)

u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ") 
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)

We loop through the 'short' vector, get the str_count and Reduce it to a single logical vector to get the sum 我们遍历'short'向量,获取str_count并将其Reduce为单个逻辑向量以获得sum

library(stringr)
sum(Reduce(`|`, lapply(short, str_count, string = long)))
#[1] 4

str_count uses the stringi functions and this don't depend on the length of the vector str_count使用stringi函数,这不依赖于vectorlength

This takes me only 0.09s with the data provided above. 使用上面提供的数据,这只需要0.09秒。

system.time(sum(sapply(regmatches(long, gregexpr(paste(short, collapse = "|"), long, ignore.case = F, perl = T)), length) >= 1))
   User      System verstrichen 
   0.09        0.00        0.09

Data: 数据:

library(gsubfn)
u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ") 
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有一种简短/优雅/有效的方式来编写此代码吗? - There is a shorter/elegant/efficient way of writing this? 计算字符串格式的两次时间的最有效方法是什么 - What is the most efficient way to calculate difftime of two time which are string format 计算具有大量参数组合的函数的最有效方法 - Most efficient way to calculate function with large number of parameter combinations 计算数组乘法的有效方法 - Efficient way to calculate array multiplication 是否有更短的方法从字符串中提取日期? - Is there a shorter way to extract a date from a string? 一种更有效的方法来计算每日平均时间序列,其中包括原始序列(在R中)找不到的日期? - More efficient way to calculate a daily means time series which includes dates not found in the original series (in R)? 计算子组中运行的更有效方法 - More efficient way to calculate runs in subgroups 计算矩阵逆的对角线的有效方法 - Efficient way to calculate diagonal of the inverse of a matrix 有没有更有效的方法来计算 R 中的月份差异 - Is there a more efficient way to calculate the difference in months in R 是否有一种有效的方法来计算均值的滚动偏差? - Is there an efficient way to calculate rolling deviation from mean?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM