简体   繁体   中英

Exact Matching text with dataframe column in r

I have a vector of words in R:

words = c("Awesome","Loss","Good","Bad")

And I have the following dataframe in R:

df <- data.frame(ID = c(1,2,3),
                 Response = c("Today is an awesome day", 
                              "Yesterday was a bad day,but today it is good",
                              "I have losses today"))

What I want to do is words that are exact matching in Response column should be extracted and inserted into new column in dataframe. Final output should look like this

ID           Response                        Match          
1            Today is an awesome day        Awesome           
2            Yesterday was a bad day        Bad,Good           
             ,but today it is good      
3            I have losses today            NA

I used the following code:

extract the list of matching words

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))

paste the matching words together

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

But it is providing the match, but not the exact. Please help.

If you use anchors in your words vector, you will ensure exact matches: ^ asserts that you're at the start, $ that you're at the end of a word. So:

words = c("Awesome","^Loss$","Good","Bad")

Then use your code:

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

which gives:

> df
  ID                                     Response    Words
1  1                      Today is an awesome day  Awesome
2  2 Yesterday was a bad day,but today it is good Good,Bad
3  3                          I have losses today  

To turn blanks to NA :

df$Words[df$Words == ""] <- NA

We can use str_extract_all

library(stringr)
library(dplyr)
library(purrr)
df %>%
    mutate(Words = map_chr(str_extract_all(Response, str_c("
       (?i)\\b(", str_c(words, collapse="|"), ")\\b")), toString))
#   ID                                     Response     Words
#1  1                      Today is an awesome day   awesome
#2  2 Yesterday was a bad day,but today it is good bad, good
#3  3                          I have losses today          

data

words <- c("Awesome","Loss","Good","Bad")

Change the first *apply function to a two lines function. If the regex becomes "\\bword\\b" then it captures the word surrounded by boundaries.

x <- sapply(words, function(x) {
  y <- paste0("\\b", x, "\\b")
  grepl(tolower(y), tolower(df$Response))
})

Now run the second apply as posted in the question.

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

df
#  ID                                     Response    Words
#1  1                      Today is an awesome day  Awesome
#2  2 Yesterday was a bad day,but today it is good Good,Bad
#3  3                          I have losses today       

As for the NA 's, I will use function is.na<- .

is.na(df$Words) <- df$Words == ""

Data.

df <- read.table(text = "
ID           Response
1            'Today is an awesome day'
2            'Yesterday was a bad day,but today it is good'
3            'I have losses today'
", header = TRUE)

words <- c("Awesome","Loss","Good","Bad")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM