[英]R match whole words in phrases
I have a character vector我有一个字符向量
var1 <- c("pine tree", "forest", "fruits", "water")
and a list和一个清单
var2 <- list(c("tree", "house", "star"), c("house", "tree", "pine tree", "tree pine", "dense forest"), c("apple", "orange", "grapes"))
I want to match words in var1 with words in var2, and extract the maximum matching element in var2.我想将 var1 中的单词与 var2 中的单词匹配,并提取 var2 中的最大匹配元素。 For example,
例如,
[[1]]
[1] "tree" "house" "star"
has 1 match with var1与 var1 有 1 场比赛
[[2]]
[1] "house" "tree" "pine tree" "tree pine" "dense forest"
has 4 matches with var1与 var1 有 4 个匹配项
[[3]]
[1] "apple" "orange" "grapes"
has 0 match with var1与 var1 有 0 个匹配
And the desired output is the following:所需的输出如下:
[[2]]
[1] "house" "tree" "pine tree" "tree pine" "dense forest"
I tried我试过
sapply(var1, grep, var2, ignore.case=T, value=T)
without getting the output desired.没有获得所需的输出。
How to solve it?如何解决? A code snippet would be appreciated.
代码片段将不胜感激。 Thanks.
谢谢。
We create a pattern string ('pat') for the grepl
, by first splitting the 'var1' by space '\\\\s+'
.我们为
grepl
创建一个模式字符串 ('pat'),首先将 'var1' 用空格'\\\\s+'
分割。 The output will be a list.输出将是一个列表。 We use
sapply
to loop over the list, use paste
with collapse= '|'
我们使用
sapply
循环遍历列表,使用带有collapse= '|'
paste
, and then collapse the whole vector to a single string with another paste
. ,然后使用另一个
paste
将整个向量折叠为单个字符串。 The |
的
|
acts as OR
while using as pattern for grepl
in v1
.在
v1
用作grepl
模式时充当OR
。 The sum
vector ('v1') will be used for subsetting the list
'var2' based on the condition described in the question. sum
向量 ('v1') 将用于根据问题中描述的条件对list
'var2' 进行子集化。
pat <- paste(sapply(strsplit(var1, '\\s+'), function(x)
paste(unique(c(x, paste(x, collapse=' '))), collapse='|')),
collapse='|')
v1 <- sapply(var2, function(x) sum(grepl(pat, x)))
v1
#[1] 1 4 0
var2[which.max(v1)]
#[[1]]
#[1] "house" "tree" "pine tree" "tree pine" "dense forest"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.