简体   繁体   English

R 中的负向后视,多词分离

[英]Negative lookbehind in R with multi-word separation

I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.我正在使用 R 进行一些字符串处理,并想识别具有某个词根的字符串,而这些字符串之前没有某个词根的另一个词。

Here is a simple toy example.这是一个简单的玩具示例。 Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.假设我想识别字符串中任何地方都没有以“dog/s”开头的单词“cat/s”的字符串。

 tests = c(
   "dog cat",
   "dogs and cats",
   "dog and cat", 
   "dog and fluffy cats",
   "cats and dogs", 
   "cat and dog",  
   "fluffy cats and fluffy dogs")  

Using this pattern, I can pull the strings that do have dog before cat:使用这种模式,我可以拉动猫之前确实有狗的字符串:

 pattern = "(dog(s|).*)(cat(s|))"
 grep(pattern, tests, perl = TRUE, value = TRUE)

[1] "dog cat"  "dogs and cats"   "dog and cat"   "dog and fluffy cats"

My negative lookbehind is having problems:我的负面回顾有问题:

 neg_pattern = "(?<!dog(s|).*)(cat(s|))"
 grep(neg_pattern, tests, perl = TRUE, value = TRUE)

Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression grep(neg_pattern, tests, perl = TRUE, value = TRUE) 中的错误:正则表达式无效

In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'另外:警告消息:在 grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE 模式编译错误 'lookbehind assertion is not fixed length' at ')(cat(s|))'

I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?我知道 .* 不是固定长度,所以我怎么能拒绝在“cat”之前有“dog”被任意数量的其他单词分隔的字符串?

I hope that this can help:我希望这可以帮助:

tests = c(
  "dog cat",
  "dogs and cats",
  "dog and cat", 
  "dog and fluffy cats",
  "cats and dogs", 
  "cat and dog",  
  "fluffy cats and fluffy dogs"
)

# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]

# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]

tests

[1] "cats and dogs"               "cat and dog"                
[3] "fluffy cats and fluffy dogs"

I'm not sure if you wanted to do this with one expression, but Regex can still be very useful when applied iteratively.我不确定您是否想用一个表达式来做到这一点,但在迭代应用时,正则表达式仍然非常有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM