I have a character vector containing short strings:
short <- c("aaa", "bah", "dju", "kjs")
I want to count the number of strings in the following vector in which at least one of the above short strings is present.
long <- c("aaajhd", "slilduaaadifh", "sldifjsdbahsdofiusd", "sdflisjdjukjs", "sldifjbak", "sdfoiuwebss", "sdkfuhsd", "sdlfihwoio")
The number it should output for this is 4, as 4 of the strings in the long
vector contain the shorter strings defined in the short
vector.
My actual short vector is around 10000 strings and long is around 1000, so I am looking for an efficient way to calculate this.
Thanks!
This takes about 0.12 seconds on my laptop where long
and short
are from the Note at the end and have lengths 10000 and 1000. No packages are used -- only to generate the sample data.
system.time(num <- length(grep(paste(short, collapse = "|"), long, perl = TRUE)))
user system elapsed
0.08 0.00 0.12
In comparison the Reduce/str_count solution takes 6.5 seconds.
Note: We take the first 1000 and 10000 words from the book Ulysses as the sample data.
library(gsubfn)
u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ")
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)
We loop through the 'short' vector, get the str_count
and Reduce
it to a single logical vector to get the sum
library(stringr)
sum(Reduce(`|`, lapply(short, str_count, string = long)))
#[1] 4
str_count
uses the stringi
functions and this don't depend on the length
of the vector
This takes me only 0.09s with the data provided above.
system.time(sum(sapply(regmatches(long, gregexpr(paste(short, collapse = "|"), long, ignore.case = F, perl = T)), length) >= 1))
User System verstrichen
0.09 0.00 0.09
Data:
library(gsubfn)
u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ")
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.