简体   繁体   English

R 匹配短语中的整个单词

[英]R match whole words in phrases

I have a character vector我有一个字符向量

var1 <- c("pine tree", "forest", "fruits", "water")

and a list和一个清单

var2 <- list(c("tree", "house", "star"),  c("house", "tree", "pine tree", "tree pine", "dense forest"), c("apple", "orange", "grapes"))

I want to match words in var1 with words in var2, and extract the maximum matching element in var2.我想将 var1 中的单词与 var2 中的单词匹配,并提取 var2 中的最大匹配元素。 For example,例如,

[[1]]
[1] "tree"  "house" "star" 

has 1 match with var1与 var1 有 1 场比赛

[[2]]
[1] "house"        "tree"         "pine tree"    "tree pine"    "dense forest"

has 4 matches with var1与 var1 有 4 个匹配项

[[3]]
[1] "apple"  "orange" "grapes"

has 0 match with var1与 var1 有 0 个匹配

And the desired output is the following:所需的输出如下:

[[2]]
[1] "house"        "tree"         "pine tree"    "tree pine"    "dense forest"

I tried我试过

sapply(var1, grep,  var2, ignore.case=T, value=T)

without getting the output desired.没有获得所需的输出。

How to solve it?如何解决? A code snippet would be appreciated.代码片段将不胜感激。 Thanks.谢谢。

We create a pattern string ('pat') for the grepl , by first splitting the 'var1' by space '\\\\s+' .我们为grepl创建一个模式字符串 ('pat'),首先将 'var1' 用空格'\\\\s+'分割。 The output will be a list.输出将是一个列表。 We use sapply to loop over the list, use paste with collapse= '|'我们使用sapply循环遍历列表,使用带有collapse= '|' paste , and then collapse the whole vector to a single string with another paste . ,然后使用另一个paste将整个向量折叠为单个字符串。 The || acts as OR while using as pattern for grepl in v1 .v1用作grepl模式时充当OR The sum vector ('v1') will be used for subsetting the list 'var2' based on the condition described in the question. sum向量 ('v1') 将用于根据问题中描述的条件对list 'var2' 进行子集化。

 pat <- paste(sapply(strsplit(var1, '\\s+'), function(x)
     paste(unique(c(x, paste(x, collapse=' '))), collapse='|')),
     collapse='|')
 v1 <- sapply(var2, function(x) sum(grepl(pat, x)))
 v1
 #[1] 1 4 0
 var2[which.max(v1)]
 #[[1]]
 #[1] "house"        "tree"         "pine tree"    "tree pine"    "dense forest"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM