[英]Using grepl() in R to match two consecutive words in a sentence (or: How to use wildcards in grepl())?
Suppose I would like to match two consecutive words in a sentence but explicitly not match other sentences that might still contain both of these words but not one right after the other.假设我想匹配一个句子中的两个连续单词,但明确不匹配可能仍然包含这两个单词但不是一个接一个地包含的其他句子。
mydata <- data.frame(text=c("I like pizza, and a read a novel.", "I like novels."))
So, if I do this...所以,如果我这样做...
grepl("lik.*? novel.*?", mydata$text, perl=T, ignore.case=T)
...I get "[1] TRUE TRUE", while what I need is "FALSE TRUE" given that "like" in the first sentence doesn't refer to "novel". ...我得到“[1] TRUE TRUE”,而我需要的是“FALSE TRUE”,因为第一句中的“喜欢”不是指“小说”。
Now, this might be a bad example, given that I could simply search for "Like novel.*?"现在,这可能是一个不好的例子,因为我可以简单地搜索“喜欢小说。*?” without a wildcard for the first word, but suppose further that I need to use this wildcard for the first of the two words, too.第一个单词没有通配符,但进一步假设我也需要对这两个单词中的第一个使用这个通配符。
And connected to that: How would one match a word in a sentence with a wildcard in the middle of said word?并与此相关:如何将句子中的单词与该单词中间的通配符匹配?
Example:例子:
mydata<-data.frame(text=c("xxx abc xxx", "xxx azc xxx", "xxx a bc xxx"))
I would like to match words that start with "a" and end with "c" no matter what comes in between but the condition is that this must be one word.我想匹配以“a”开头并以“c”结尾的单词,无论中间有什么,但条件是这必须是一个单词。 Currently, I get a "TRUE" even for the third line while what I would need is a match for the first two but not for the third:目前,即使是第三行,我也得到了“TRUE”,而我需要的是前两行的匹配,而不是第三行的匹配:
grepl("a.*?c", mydata$text, perl=T, ignore.case=T)
If the words are consecutive如果单词是连续的
grepl("like\\b \\bnovel", mydata$text, perl=TRUE, ignore.case=TRUE)
#[1] FALSE TRUE
and for second case, we can usse the word boundary ( \\\\b
) at the beginning and end of对于第二种情况,我们可以在开头和结尾使用单词边界( \\\\b
)
grepl("\\ba\\w+c\\b", mydata$text, perl = TRUE, ignore.case = TRUE)
#[1] TRUE TRUE FALSE
Here the pattern to match is a word boundary ( \\\\b
) followed by character 'a', one or more characters ( \\\\w+
) and 'c' followed by word boundary ( \\\\b
)这里要匹配的模式是单词边界 ( \\\\b
) 后跟字符 'a'、一个或多个字符 ( \\\\w+
) 和 'c' 后跟单词边界 ( \\\\b
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.