简体   繁体   English

使用R中的Grepl查找数据框列中存在的单词列表

[英]Finding list of word present in column of a Dataframe using Grepl in R

I have a dataframe df: 我有一个数据框df:

df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"), 
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")

Also, I have a list of word: 另外,我有一个单词列表:

wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")

I looked for the words from wordlist are present in the column text or not by unlisting the text and using grepl. 我通过取消列出文本并使用grepl从单词列表中查找单词是否存在于列文本中。

library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]

The problem is, I want to find exact words of the wordlist present in Column text. 问题是,我想找到列文本中存在的单词表的确切单词。 With grepl it also shows word with partial match, for example AudiA6 from text was also partially matched to word Audi present in wordlist. 使用grepl时,它还显示部分匹配的单词,例如,文本中的AudiA6也与单词列表中存在的奥迪单词部分匹配。 Also my dataframe is very big and using grepl take a lot time in running the code. 另外,我的数据帧很大,使用grepl会花费很多时间来运行代码。 Please, if possible recommend any other Approach to do so. 请,如果可能的话,推荐其他方法。 I want something like this: 我想要这样的东西:

df <- structure(list(page = c(12, 6, 9, 65), 
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
 "Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement", 
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L), 
class = c("data.table", "data.frame"))

You can use str_extract_all from stringr after adding word boundaries ( \\\\b ) to each of the words you want to extract so only full matches are considered (and you need to collapse all words with "|" to indicate a "or"): 您可以在要提取的每个单词上添加单词边界( \\\\b )后,从stringr使用str_extract_all ,以便仅考虑完全匹配(并且需要用"|"折叠所有单词以表示“或”):

sapply(stringr::str_extract_all(df$text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")
# [1] "exchange"               "engine,replacement,BMW" "brand"                  "Volkswagen,company,BMW"

If you want to put it in your data.table : 如果要将其放在data.table

df[, match:=sapply(stringr::str_extract_all(text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")]
df
#   page                                              text                  match
#1:   12 ToyotaCorolla is offering new car exchange offers               exchange
#2:    6 Get 2 years engine replacement warranty on BMW X6 engine,replacement,BMW
#3:    9                  I just bought a brand new AudiA6                  brand
#4:   65           Volkswagen is the parent company of BMW Volkswagen,company,BMW

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM