简体   繁体   中英

Finding the number of specific character vector values in another character vector in R

I'm searching for a way to scan a character vector using another character vector. I already put so many hours in this but just can't seem to get it right, resp. I can't find a function that does what I intend to do. But I'm sure there's an easy way to solve this

So let's say I have the following vector:

    c <- c("bread", "milk", "oven", "salt")

On the other hand I have a vector containing sentences.

    text <- c("The BREAD is in the oven. Wonderful!!",
    "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
    well.", "Oven, oven, oven, why not just eat it raw.")

Now I'd like to scan the text block using the content of my c vector. The output should look something like that:

                                             text bread milk oven salt
    1       The BREAD is in the oven. Wonderful!!    1    0    1    0
    2        We don't only need Milk... as well."    0    1    0    1
    3 Oven, oven, oven, why not just eat it raw.     0    0    3    0

Another thing I'd like to do is searching for combinations rather than just for on single word.

    c <- c("need milk", "oven oven", "eat it")

Getting the same output:

                                             text need milk oven oven eat it
    1       The BREAD is in the oven. Wonderful!!     0         0        0
    2        We don't only need Milk... as well."     1         0        1
    3 Oven, oven, oven, why not just eat it raw.      0         2        1

It would be great if someone could help me! :) Thank you so much!

We can use str_count to count the number of occurrences of each pattern in the 'string'

library(stringr)
data.frame(text, sapply(c, str_count, string = tolower(text)))

Here another solution using stringi package, which at least in terms of speed (not concerning simplicity) beats the other approaches. Of course, it depends what "beat" means here, if you consider speed against simplicity and using base R.

Another thing to mention is that the grepl solution does not return the actual counts but binary counts as indicated in above comment. So its not directly comparable. However, depending on your needs this can suffice.

library(stringi)
library(stringr)
library(microbenchmark)

c <- c("bread", "milk", "oven", "salt")
text <- c("The BREAD is in the oven. Wonderful!!",
          "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
          well.", "Oven, oven, oven, why not just eat it raw.")


stringi_approach <- function() {

  matches <- sapply(c, function(w) {stri_count_fixed(text,w, case_insensitive = TRUE)})
  rownames(matches) <- text

}

grepl_approach <- function() {

  df <- data.frame(text, +(sapply(c, grepl, tolower(text))))

}

stringr_approach <- function() {

  df <- data.frame(text, sapply(c, str_count, string = tolower(text)))

}

microbenchmark(
  grepl_approach(),
  stringr_approach(),
  stringi_approach()
)

# Unit: microseconds
#         expr       min      lq     mean   median       uq     max neval
# grepl_approach() 309.091 338.500 351.3017 347.5790 352.7105 565.679   100
# stringr_approach() 380.541 418.634 437.7599 429.2925 441.7275 814.767   100
# stringi_approach() 101.057 113.492 126.9763 129.4790 133.8215 217.903   100

You can use the corpus library for this:

library(corpus)
library(Matrix)

text <- c("The BREAD is in the oven. Wonderful!!",
    "We don't only need Milk to bake a yummy bread, but a pinch of salt as 
    well.", "Oven, oven, oven, why not just eat it raw.")

term_matrix(text, select = c("bread", "milk", "oven", "salt"))
## 3 x 4 sparse Matrix of class "dgCMatrix"
##      bread milk oven salt
## [1,]     1    .    1    .
## [2,]     1    1    .    1
## [3,]     .    .    3    .

term_matrix(text, select = c("need milk", "oven oven", "eat it"), drop_punct = TRUE)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##      need milk oven oven eat it
## [1,]         .         .      .
## [2,]         1         .      .
## [3,]         .         2      1

Alternatively, you can modify one of Manuel Bickel's answers, using text_count instead of str_count .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM