简体   繁体   English

正则表达式中的重叠匹配

[英]Overlapping matches in Regex

I can't seem to find an answer to this problem, and I'm wondering if one exists.我似乎找不到这个问题的答案,我想知道是否存在。 Simplified example:简化示例:

Consider a string "nnnn", where I want to find all matches of "nn" - but also those that overlap with each other.考虑一个字符串“nnnn”,我想在其中找到“nn”的所有匹配项——但也包括那些相互重叠的匹配项。 So the regex would provide the following 3 matches:因此,正则表达式将提供以下 3 个匹配项:

  1. nn nn嗯嗯
  2. n nn n n n n
  3. nn nn嗯嗯

I realize this is not exactly what regexes are meant for, but walking the string and parsing this manually seems like an awful lot of code, considering that in reality the matches would have to be done using a pattern, not a literal string.我意识到这并不完全是正则表达式的含义,但是手动遍历字符串并解析它似乎是很多代码,考虑到实际上必须使用模式而不是文字字符串来完成匹配。

Update 2016: 2016 年更新:

To get nn , nn , nn , SDJMcHattie proposes in the comments (?=(nn)) (see regex101) .要获得nnnnnnSDJMcHattie评论中建议(?=(nn)) (参见 regex101)

(?=(nn))

Original answer (2008)原始答案(2008)

A possible solution could be to use a positive look behind :一个可能的解决方案可能是使用正面的看法

(?<=n)n

It would give you the end position of:它会给你的最终位置:

  1. n n nn n n n
  2. n n n n n n n n
  3. nn n n n n n

As mentioned by Timothy Khouri , a positive lookahead is more intuitive ( see example )正如Timothy Khouri所提到的,积极的前瞻更直观(参见示例

I would prefer to his proposition (?=nn)n the simpler form:我更喜欢他的命题(?=nn)n更简单的形式:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2) .这将引用您想要的字符串的第一个位置并将捕获 group(2) 中的第二个 n

That is so because:之所以如此,是因为:

  • Any valid regular expression can be used inside the lookahead.任何有效的正则表达式都可以在前瞻中使用。
  • If it contains capturing parentheses, the backreferences will be saved .如果它包含捕获括号,则反向引用将被保存

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).因此 group(1) 和 group(2) 将捕获任何 'n' 表示的内容(即使它是一个复杂的正则表达式)。


Using a lookahead with a capturing group works, at the expense of making your regex slower and more complicated.对捕获组使用前瞻是可行的,但代价是使您的正则表达式变得更慢和更复杂。 An alternative solution is to tell the Regex.Match() method where the next match attempt should begin.另一种解决方案是告诉 Regex.Match() 方法下一次匹配尝试应该从哪里开始。 Try this:尝试这个:

Regex regexObj = new Regex("nn");
Match matchObj = regexObj.Match(subjectString);
while (matchObj.Success) {
    matchObj = regexObj.Match(subjectString, matchObj.Index + 1); 
}

AFAIK, there is no pure regex way to do that at once (ie. returning the three captures you request without loop). AFAIK,没有纯粹的正则表达式方法可以立即执行此操作(即返回您请求的三个捕获而无需循环)。

Now, you can find a pattern once, and loop on the search starting with offset (found position + 1).现在,您可以找到一次模式,然后从偏移量(找到的位置 + 1)开始循环搜索。 Should combine regex use with simple code.应该将正则表达式的使用与简单的代码结合起来。

[EDIT] Great, I am downvoted when I basically said what Jan shown... [编辑] 太好了,当我基本上说出 Jan 所展示的内容时,我被否决了……
[EDIT 2] To be clear: Jan's answer is better. [编辑 2] 要明确:简的答案更好。 Not more precise, but certainly more detailed, it deserves to be chosen.不是更精确,但肯定更详细,值得选择。 I just don't understand why mine is downvoted, since I still see nothing incorrect in it.我只是不明白为什么我的投票被否决了,因为我仍然没有看到任何不正确的地方。 Not a big deal, just annoying.没什么大不了的,就是烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM