简体   繁体   中英

Efficient way to calculate number of strings which contain a defined shorter string

I have a character vector containing short strings:

short <- c("aaa", "bah", "dju", "kjs")

I want to count the number of strings in the following vector in which at least one of the above short strings is present.

long <- c("aaajhd", "slilduaaadifh", "sldifjsdbahsdofiusd", "sdflisjdjukjs", "sldifjbak", "sdfoiuwebss", "sdkfuhsd", "sdlfihwoio")

The number it should output for this is 4, as 4 of the strings in the long vector contain the shorter strings defined in the short vector.

My actual short vector is around 10000 strings and long is around 1000, so I am looking for an efficient way to calculate this.

Thanks!

This takes about 0.12 seconds on my laptop where long and short are from the Note at the end and have lengths 10000 and 1000. No packages are used -- only to generate the sample data.

system.time(num <- length(grep(paste(short, collapse = "|"), long, perl = TRUE)))
   user  system elapsed 
   0.08    0.00    0.12 

In comparison the Reduce/str_count solution takes 6.5 seconds.

Note: We take the first 1000 and 10000 words from the book Ulysses as the sample data.

library(gsubfn)

u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ") 
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)

We loop through the 'short' vector, get the str_count and Reduce it to a single logical vector to get the sum

library(stringr)
sum(Reduce(`|`, lapply(short, str_count, string = long)))
#[1] 4

str_count uses the stringi functions and this don't depend on the length of the vector

This takes me only 0.09s with the data provided above.

system.time(sum(sapply(regmatches(long, gregexpr(paste(short, collapse = "|"), long, ignore.case = F, perl = T)), length) >= 1))
   User      System verstrichen 
   0.09        0.00        0.09

Data:

library(gsubfn)
u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ") 
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM