[英]Partial string match two columns R
I have been trying to partially match two column contents based on a list of regular expressions common to both columns: 我一直在尝试基于两列共有的正则表达式列表来部分匹配两列内容:
dats<-data.frame(ID=c(1:3),species=c("dog","cat","rabbit"),
species.descriptor=c("all animal dog","all animal cat","rabbit exotic"),product=c(1,2,3),
product.authorise=c("all animal dog cat rabbit","cat horse pig","dog cat"))
with the aim of achieving this: 为了实现这一目标:
goal<-data.frame(ID=c(1:3),species=c("dog","cat","rabbit"),
species.descriptor=c("all animal dog","all animal cat","rabbit exotic"),
product=c(1,2,3),product.authorise=c("all animal dog cat rabbit","cat horse pig",
"dog cat"), authorised=c("TRUE","TRUE","FALSE"))
So to explain further, if 'dog' appears at any point in both columns, then this would be considered 'TRUE' in $match - and this would apply for any individual species descriptor.If no matches are found, then a return of either FALSE or an na would be fine. 所以为了进一步解释,如果'dog'出现在两列中的任何一点,那么在$ match中这将被视为'TRUE' - 这将适用于任何单个物种描述符。如果没有找到匹配,那么返回FALSE或na会没事的。
So far I have gotten to this point: 到目前为止,我已经达到了这一点:
library(stringr)
patts<-c("dog","cat","all animal")
reg.patts<-paste(patts,collapse="|")
dats$matched<-ifelse((str_extract(dats$species.descriptor,reg.patts) == str_extract(dats$product.authorise,reg.patts)),"TRUE","FALSE")
dats
ID species species.descriptor product product.authorise matched
1 dog all animal dog 1 all animal dog cat rabbit TRUE
2 cat all animal cat 2 cat horse pig FALSE
3 rabbit rabbit exotic 3 dog cat <NA>
As you can see, this correctly identifies the first and last rows as 'all animal' appears first in both strings, and there is no match at all in the last. 正如您所看到的,这正确地标识了第一行和最后一行,因为“所有动物”在两个字符串中首先出现,并且在最后一行中根本没有匹配。 However, it seems to struggle (as in the second row) when the reg exp doesn't appear first in the string. 但是,当reg exp没有首先出现在字符串中时,似乎很难(如第二行)。 I have tried str_extract_all, but have only resulted in error messages so far. 我已经尝试过str_extract_all,但到目前为止只导致错误消息。 I was wondering if anyone can help, please? 我想知道是否有人可以提供帮助,拜托?
Here is a solution using dplyr
for piping. 这是使用dplyr
进行管道处理的解决方案。 The core component is using grepl
for logical string matching of species
in both species.descriptor
and product.authorised
. 芯组分是使用grepl
为逻辑字符串匹配species
中都species.descriptor
和product.authorised
。
library(dplyr)
dats %>%
rowwise() %>%
mutate(authorised =
grepl(species, species.descriptor) &
grepl(species, product.authorise)
)
Source: local data frame [3 x 6]
Groups: <by row>
ID species species.descriptor product product.authorise authorised
(int) (fctr) (fctr) (dbl) (fctr) (lgl)
1 1 dog all animal dog 1 all animal dog cat rabbit TRUE
2 2 cat all animal cat 2 cat horse pig TRUE
3 3 rabbit rabbit exotic 3 dog cat FALSE
If you really like stringr
you can use the str_detect
function for more user friendly syntax. 如果你真的喜欢stringr
你可以使用str_detect
函数获得更加用户友好的语法。
library(stringr)
dats %>%
mutate(authorised =
str_detect(species.descriptor, species) &
str_detect(product.authorise, species)
)
And if you don't like dplyr
you can add the column directly 如果您不喜欢dplyr
,可以直接添加列
dats$authorised <-
with(dats,
str_detect(species.descriptor, species) &
str_detect(product.authorise, species)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.