Is there a way in R to find values in a column that contain a word? For example, I want to find all the values that contain the word "the", where some values of the column are "the_cat" and "the_dog" and "dog"
x <- c("the_dog", "the_cat", "dog")
Using the example above, the answer would be 2. I know this is relatively easy to do in Python, but I am wondering if there is a way to do this in R. Thanks!
Try:
sum(grepl("(?<![A-Za-z])the(?![A-Za-z])", x, perl = T))
This gives a sum of 2 on your example.
But let's consider also a slightly more complex example:
x <- c("the_dog", "the_cat", "dog", "theano", "menthe", " the")
Output:
[1] 3
Above we're trying to match any the
that doesn't have another letter before or after (like eg theano
).
You could also add inside the []
other things you wouldn't like to match, like eg if you wouldn't consider the99
a word the
, you would do [A-Za-z0-9]
etc.
You can also use the above with stringr
, for example (I've included the exclusion of numbers, so below the99
wouldn't be counted as a word):
library(stringr)
sum(str_detect(x, "(?<![A-Za-z0-9])the(?![A-Za-z0-9])"))
library(stringr)
##with a vector
sum(str_detect(c("the_dog", "the_cat", "dog"),"the"))
##In a dataframe
tibble(x = c("the_dog", "the_cat", "dog")) %>%
filter(str_detect(x, "the")) %>%
nrow()
x <- c("the_dog", "the_cat", "dog")
stringr::str_detect(x, "the")
#> [1] TRUE TRUE FALSE
Created on 2019-02-23 by the reprex package (v0.2.1)
Try also:
x <- c("the_dog", "the_cat", "dog")
sum(stringi::stri_count(x,regex="^the"))#matches the at the beginning
Result:
[1] 2
Or:
x <- c("the_dog", "the_cat", "dog")
sum(stringi::stri_count(x,regex="the{1,}"))#matches any the
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.