R 中的负向后视，多词分离

Question

I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.我正在使用 R 进行一些字符串处理，并想识别具有某个词根的字符串，而这些字符串之前没有某个词根的另一个词。

Here is a simple toy example.这是一个简单的玩具示例。 Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.假设我想识别字符串中任何地方都没有以“dog/s”开头的单词“cat/s”的字符串。

 tests = c(
   "dog cat",
   "dogs and cats",
   "dog and cat", 
   "dog and fluffy cats",
   "cats and dogs", 
   "cat and dog",  
   "fluffy cats and fluffy dogs")

Using this pattern, I can pull the strings that do have dog before cat:使用这种模式，我可以拉动猫之前确实有狗的字符串：

 pattern = "(dog(s|).*)(cat(s|))"
 grep(pattern, tests, perl = TRUE, value = TRUE)

[1] "dog cat"  "dogs and cats"   "dog and cat"   "dog and fluffy cats"

My negative lookbehind is having problems:我的负面回顾有问题：

 neg_pattern = "(?<!dog(s|).*)(cat(s|))"
 grep(neg_pattern, tests, perl = TRUE, value = TRUE)

Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression grep(neg_pattern, tests, perl = TRUE, value = TRUE) 中的错误：正则表达式无效

In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'另外：警告消息：在 grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE 模式编译错误 'lookbehind assertion is not fixed length' at ')(cat(s|))'

I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?我知道 .* 不是固定长度，所以我怎么能拒绝在“cat”之前有“dog”被任意数量的其他单词分隔的字符串？

Answer 1

I hope that this can help:我希望这可以帮助：

tests = c(
  "dog cat",
  "dogs and cats",
  "dog and cat", 
  "dog and fluffy cats",
  "cats and dogs", 
  "cat and dog",  
  "fluffy cats and fluffy dogs"
)

# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]

# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]

tests

[1] "cats and dogs"               "cat and dog"                
[3] "fluffy cats and fluffy dogs"

I'm not sure if you wanted to do this with one expression, but Regex can still be very useful when applied iteratively.我不确定您是否想用一个表达式来做到这一点，但在迭代应用时，正则表达式仍然非常有用。

R 中的负向后视，多词分离

问题描述

1 个解决方案

解决方案1
0 2017-09-25 09:09:17

R 中的负向后视，多词分离

问题描述

1 个解决方案

解决方案1 0 2017-09-25 09:09:17

解决方案1
0 2017-09-25 09:09:17