简体   繁体   中英

matching text with dataframe column in r

I have a vector of word in r.

words = c("Awesome","Loss","Good","Bad")

And,I have following dataframe in r

ID           Response
1            Today is an awesome day
2            Yesterday was a bad day,but today it is good
3            I have losses today

What I want to do is words that are matching in Response column should be extracted and inserted into new column in dataframe. Final output should look like this

ID           Response                        Match          Count 
1            Today is an awesome day        Awesome           1
2            Yesterday was a bad day        Bad,Good          2 
             ,but today it is good      
3            I have losses today             Loss             1

I did following in r

sapply(words,grepl,df$Response)

It matches the words,but how would I get my dataframe in desired format? Please help.

using base R - (credits to PereG too for help in concised answer to df$Counts)

# extract the list of matching words
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))

# paste the matching words together
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

# count the number of matching words
df$Count <- apply(x, 1, function(i) sum(i))

# df
#  ID                                     Response    Words Count
#1  1                      Today is an awesome day  Awesome     1
#2  2 Yesterday was a bad day,but today it is good Good,Bad     2
#3  3                          I have losses today     Loss     1

Here's another option, which stores the matches in list s:

vgrepl <- Vectorize(grepl, "pattern")
df$Match <- lapply(df$Response, function(x) 
  words[vgrepl(words, x, ignore.case=T)]
)
df$Count <- lengths(df$Match)

With df as the dataframe and using stringr the following will also work:

matches <- sapply(1:length(words), function(i) str_extract_all(tolower(df$Response),
                                                     tolower(words[i]), simplify = TRUE))
df$Match <- gsub('[,][,]+|^,|,$', '', apply(matches, 1, paste, collapse=','))
df$Count <- apply(matches, 1, function(x) sum(x != ''))
head(df)

#  ID                                     Response    Match Count
#1  1                      Today is an awesome day  awesome     1
#2  2 Yesterday was a bad day,but today it is good good,bad     2
#3  3                          I have losses today     loss     1

Solution/suggestion in tidyverse . It reports the actual matches, not the patterns which were matched case-insensitive, but it should be sufficient for illustration purposes.

library(stringr)
library(dplyr)
library(purrr)

words <- c("Awesome", "Loss", "Good", "Bad")
"ID;Response
1;Today is an awesome day
2;Yesterday was a bad day,but today it is good
3;I have losses today" %>%
  textConnection %>%
  read.table(header = TRUE, 
             sep = ";",
             stringsAsFactors = FALSE) ->
  d

d %>%
  mutate(matches = str_extract_all(
                     Response,
                     str_c(words, collapse = "|") %>% regex(ignore_case = T)),
         Match = map_chr(matches, str_c, collapse = ","),
         Count = map_int(matches, length))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM