简体   繁体   English

正则表达式可选组被跳过

[英]regex optional group is skipped

I'm trying to capture optional group inside required group. 我正在尝试在所需组中捕获可选组。 This is my regex so far: 到目前为止,这是我的正则表达式:

BEGIN
(?<body>
     (?<A>.*?)               # in my regex just .*?
     (?<B>START.*?STOP)?
     (?<C>.*?)               # in my regex just .*?
)
END

and for input: 和输入:

junk1 BEGIN junk2 START content STOP junk3 END junk4

I'm getting this: 我得到这个:

match: 'BEGIN junk2 START content STOP junk3 END' # ok
group 'body':  ' junk2 START content STOP junk3 ' # ok
group 'A': ''                                     # expected: ' junk2 '
group 'B': not found                              # expected: 'START content STOP'
group 'C': ' junk2 START content STOP junk3 '     # expected: ' junk3 '

groups A and C putted just for reference purpose A和C组仅供参考

Why group B is not matched even if there is correct data and I get expected result if group B is not optional? 为什么即使没有正确的数据,B组也不匹配,并且如果B组不是可选的,我也会得到预期的结果?

Why it doesn't work 为什么它不起作用

You must have some basic understanding of how the regex engine and backtracking works. 您必须对正则表达式引擎和回溯的工作原理有一些基本的了解。 In this case it goes something like this: 在这种情况下,它是这样的:

Group (?<A>.*?) will match the empty string at the first attempt (as that is what the expression means: empty string or more if required). (?<A>.*?)将在第一次尝试时匹配空字符串(这就是表达式的含义:如果需要,则为空字符串或更多)。

Then we get to group B right after BEGIN , as there is no START here, the whole inner group will fail, and the optional group will be skipped. 然后我们在BEGIN之后进入B组,因为这里没有START ,整个内部组将失败,并且可选组将被跳过。

Group C will match the empty string. C组将匹配空字符串。 Then we try to match END , which doesn't match. 然后,我们尝试匹配不匹配的END Thus the regex engine will try to backtrack the last quantifier. 因此,正则表达式引擎将尝试回溯最后一个量词。 In this case in group C . 在这种情况下,组C It will do this till a match is found (or try the quantifiers before it, till failing). 它将一直执行到找到匹配项为止(或在尝试之前使用量词,直到失败)。

So we end up with group C expanding till END , and then the whole expression matches. 因此,我们最终将C组扩展到END ,然后整个表达式匹配。

Example solution 解决方案示例

A simple solution could be made if START / STOP are not allowed inside the except for in the optional group you could use an expression like: 如果除了可选组中的不允许使用START / STOP之外,则可以做出一个简单的解决方案,您可以使用以下表达式:

BEGIN
(?<body>
     (?<A>
          (?: (?!START|STOP) . )*?   # do not match START nor STOP
     )
     (?<B>START.*?STOP)?
     (?<C>
          (?: (?!START|STOP) . )*?   # do not match START nor STOP
     )
)
END

In c#, pattern matching is driven by the regular expression pattern (more here ), not the input text. 在c#中,模式匹配是由正则表达式模式( 此处更多信息 )而非输入文本驱动的。 So, what happens is group A is not required to match anything, so decission delayed, group B is not required to match anything, so decission delayed, group C is not required to match anything but end of regular expression is reached. 因此,发生的情况是,组A不需要匹配任何内容,因此决策延迟,B组不需要匹配任何事物,因此决策延迟,C组不需要匹配任何事物,但是达到了正则表达式的结尾。 Group C is matched agains the input string, and gets assigned all what you expect to be in group B. And if you use a right to left pattern matching, all the content will be to group A. C组再次匹配输入字符串,并被分配了您期望在B组中使用的所有内容。如果使用从右到左模式匹配,则所有内容将在A组中。

Constructs earlier in the pattern are prioritized over those later in the pattern, and the lazy quantifier ( *? ) prioritizes shorter matches. 模式前面的构造优先于模式后面的构造,而惰性量词( *? )优先考虑较短的匹配。 So the best match will always be to not match anything the A group, and since START can't match at the first position, it will be skipped. 因此,最佳匹配将始终是不匹配任何A组,并且由于START无法在第一个位置匹配,因此将被跳过。 Finally the C group will eat the rest of the string, since the END is not optional. 最后, C组将吃掉其余的字符串,因为END不是可选的。

Use this instead: 使用此代替:

BEGIN
(?<body>
    (?<A>.*?)
    (?:
        (?<B>START.*?STOP)
        (?<C>.*?)
    )?
)
END

It will force the A group to eat as much as it can, up until the first match of B if it exists. 它将迫使A组吃得尽可能多,直到B的第一个匹配(如果存在)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM