简体   繁体   English

R-regex:匹配不以模式开头的字符串

[英]R-regex: match strings not beginning with a pattern

I'd like to use regex to see if a string does not begin with a certain pattern.我想使用正则表达式来查看字符串是否以某种模式开头。 While I can use: [^ to blacklist certain characters, I can't figure out how to blacklist a pattern.虽然我可以使用: [^将某些字符列入黑名单,但我不知道如何将模式列入黑名单。

> grepl("^[^abc].+$", "foo")
[1] TRUE
> grepl("^[^abc].+$", "afoo")
[1] FALSE

I'd like to do something like grepl("^[^(abc)].+$", "afoo") and get TRUE , ie to match if the string does not start with abc sequence.我想做一些类似grepl("^[^(abc)].+$", "afoo")并获得TRUE ,即匹配字符串是否以abc序列开头。

Note that I'm aware of this post , and I also tried using perl = TRUE , but with no success:请注意,我知道这篇文章,我也尝试使用perl = TRUE ,但没有成功:

> grepl("^((?!hede).)*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^((?!hede).)*$", "foohede", perl = TRUE)
[1] FALSE

Any ideas?有任何想法吗?

Yeah.是的。 Put the zero width lookahead /outside/ the other parens.将零宽度前瞻/外部/其他括号。 That should give you this:那应该给你这个:

> grepl("^(?!hede).*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^(?!hede).*$", "foohede", perl = TRUE)
[1] TRUE

which I think is what you want.我认为这是你想要的。

Alternately if you want to capture the entire string, ^(?!hede)(.*)$ and ^((?!hede).*)$ are both equivalent and acceptable.或者,如果您想捕获整个字符串, ^(?!hede)(.*)$^((?!hede).*)$都是等效的并且可以接受。

I got stuck on the following special case, so I thought I would share...我陷入了以下特殊情况,所以我想我会分享......

What if there are multiple instances of the regular expression, but you still only want the first segment?如果正则表达式有多个实例,但您仍然只想要第一段怎么办?

Apparently you can turn off the implicit greediness of the search with specific perl wildcard modifiers显然,您可以使用特定的perl 通配符修饰符关闭搜索的隐式贪婪

Suppose the string I wanted to process was假设我要处理的字符串是

myExampleString = paste0(c(letters[1:13], "_", letters[14:26], "__",
                           LETTERS[1:13], "_", LETTERS[14:26], "__",
                           "laksjdl", "_", "lakdjlfalsjdf"),
                         collapse = "")
myExampleString

"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd" "abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd"

and that I wanted only the first segment before the first "__" .并且我只想要第一个"__"之前的第一段。 I cannot simply search on "_" , because single-underscore is an allowable non-delimiter in this example string.我不能简单地搜索"_" ,因为在此示例字符串中单下划线是允许的非分隔符。

The following doesn't work.以下不起作用。 It instead gives me the first and second segments because of the default greediness (but not third, because of the forward-look).因为默认的贪婪,它反而给了我第一第二段(但不是第三段,因为前瞻性)。

gsub("^(.+(?=__)).*$", "\\1", myExampleString, perl = TRUE)

"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ" "abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ"

But this does work但这确实有效

gsub("^(.+?(?=__)).*$", "\\1", myExampleString, perl = TRUE)

"abcdefghijklm_nopqrstuvwxyz" “abcdefghijklm_nopqrstuvwxyz”

The difference is the greedy-modifier "?"区别在于贪婪修饰符"?" after the wildcard ".+" in the (perl) regular expression.在 (perl) 正则表达式中的通配符".+"之后。

There is now (years later) another possibility with the stringr package.现在(多年后) stringr包有另一种可能性。

library(stringr)

str_detect("dsadsf", "^abc", negate = TRUE)
#> [1] TRUE

str_detect("abcff", "^abc", negate = TRUE)
#> [1] FALSE

Created on 2020-01-13 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 1 月 13 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM