Efficient way to calculate number of strings which contain a defined shorter string

Question

I have a character vector containing short strings:

short <- c("aaa", "bah", "dju", "kjs")

I want to count the number of strings in the following vector in which at least one of the above short strings is present.

long <- c("aaajhd", "slilduaaadifh", "sldifjsdbahsdofiusd", "sdflisjdjukjs", "sldifjbak", "sdfoiuwebss", "sdkfuhsd", "sdlfihwoio")

The number it should output for this is 4, as 4 of the strings in the long vector contain the shorter strings defined in the short vector.

My actual short vector is around 10000 strings and long is around 1000, so I am looking for an efficient way to calculate this.

Thanks!

Answer 1

This takes about 0.12 seconds on my laptop where long and short are from the Note at the end and have lengths 10000 and 1000. No packages are used -- only to generate the sample data.

system.time(num <- length(grep(paste(short, collapse = "|"), long, perl = TRUE)))
   user  system elapsed 
   0.08    0.00    0.12

In comparison the Reduce/str_count solution takes 6.5 seconds.

Note: We take the first 1000 and 10000 words from the book Ulysses as the sample data.

library(gsubfn)

u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ") 
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)

Answer 2

We loop through the 'short' vector, get the str_count and Reduce it to a single logical vector to get the sum

library(stringr)
sum(Reduce(`|`, lapply(short, str_count, string = long)))
#[1] 4

str_count uses the stringi functions and this don't depend on the length of the vector

Answer 3

This takes me only 0.09s with the data provided above.

system.time(sum(sapply(regmatches(long, gregexpr(paste(short, collapse = "|"), long, ignore.case = F, perl = T)), length) >= 1))
   User      System verstrichen 
   0.09        0.00        0.09

Data:

library(gsubfn)
u <- "http://www.gutenberg.org/files/4300/4300-0.txt"
joyce <- readLines(u)
joycec <- paste(joyce, collapse = " ") 
words <- strapplyc(joycec, "\\w+")[[1]]
short <- head(words, 1000)
long <- head(words, 10000)

Efficient way to calculate number of strings which contain a defined shorter string

Question

3 answers

solution1
4 2017-12-10 19:46:56

solution2
1 ACCPTED 2017-12-10 19:26:49

solution3
0 2017-12-10 19:43:02

Efficient way to calculate number of strings which contain a defined shorter string

Question

3 answers

solution1 4 2017-12-10 19:46:56

solution2 1 ACCPTED 2017-12-10 19:26:49

solution3 0 2017-12-10 19:43:02

solution1
4 2017-12-10 19:46:56

solution2
1 ACCPTED 2017-12-10 19:26:49

solution3
0 2017-12-10 19:43:02