[英]Negative lookbehind in R with multi-word separation
I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.我正在使用 R 进行一些字符串处理,并想识别具有某个词根的字符串,而这些字符串之前没有某个词根的另一个词。
Here is a simple toy example.这是一个简单的玩具示例。 Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.假设我想识别字符串中任何地方都没有以“dog/s”开头的单词“cat/s”的字符串。
tests = c(
"dog cat",
"dogs and cats",
"dog and cat",
"dog and fluffy cats",
"cats and dogs",
"cat and dog",
"fluffy cats and fluffy dogs")
Using this pattern, I can pull the strings that do have dog before cat:使用这种模式,我可以拉动猫之前确实有狗的字符串:
pattern = "(dog(s|).*)(cat(s|))"
grep(pattern, tests, perl = TRUE, value = TRUE)
[1] "dog cat" "dogs and cats" "dog and cat" "dog and fluffy cats"
My negative lookbehind is having problems:我的负面回顾有问题:
neg_pattern = "(?<!dog(s|).*)(cat(s|))"
grep(neg_pattern, tests, perl = TRUE, value = TRUE)
Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression grep(neg_pattern, tests, perl = TRUE, value = TRUE) 中的错误:正则表达式无效
In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'另外:警告消息:在 grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE 模式编译错误 'lookbehind assertion is not fixed length' at ')(cat(s|))'
I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?我知道 .* 不是固定长度,所以我怎么能拒绝在“cat”之前有“dog”被任意数量的其他单词分隔的字符串?
I hope that this can help:我希望这可以帮助:
tests = c(
"dog cat",
"dogs and cats",
"dog and cat",
"dog and fluffy cats",
"cats and dogs",
"cat and dog",
"fluffy cats and fluffy dogs"
)
# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]
# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]
tests
[1] "cats and dogs" "cat and dog"
[3] "fluffy cats and fluffy dogs"
I'm not sure if you wanted to do this with one expression, but Regex can still be very useful when applied iteratively.我不确定您是否想用一个表达式来做到这一点,但在迭代应用时,正则表达式仍然非常有用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.