简体   繁体   中英

Finding list of word present in column of a Dataframe using Grepl in R

I have a dataframe df:

df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"), 
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")

Also, I have a list of word:

wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")

I looked for the words from wordlist are present in the column text or not by unlisting the text and using grepl.

library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]

The problem is, I want to find exact words of the wordlist present in Column text. With grepl it also shows word with partial match, for example AudiA6 from text was also partially matched to word Audi present in wordlist. Also my dataframe is very big and using grepl take a lot time in running the code. Please, if possible recommend any other Approach to do so. I want something like this:

df <- structure(list(page = c(12, 6, 9, 65), 
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
 "Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement", 
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L), 
class = c("data.table", "data.frame"))

You can use str_extract_all from stringr after adding word boundaries ( \\\\b ) to each of the words you want to extract so only full matches are considered (and you need to collapse all words with "|" to indicate a "or"):

sapply(stringr::str_extract_all(df$text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")
# [1] "exchange"               "engine,replacement,BMW" "brand"                  "Volkswagen,company,BMW"

If you want to put it in your data.table :

df[, match:=sapply(stringr::str_extract_all(text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")]
df
#   page                                              text                  match
#1:   12 ToyotaCorolla is offering new car exchange offers               exchange
#2:    6 Get 2 years engine replacement warranty on BMW X6 engine,replacement,BMW
#3:    9                  I just bought a brand new AudiA6                  brand
#4:   65           Volkswagen is the parent company of BMW Volkswagen,company,BMW

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM