[英]regex optional group is skipped
I'm trying to capture optional group inside required group. 我正在尝试在所需组中捕获可选组。 This is my regex so far: 到目前为止,这是我的正则表达式:
BEGIN
(?<body>
(?<A>.*?) # in my regex just .*?
(?<B>START.*?STOP)?
(?<C>.*?) # in my regex just .*?
)
END
and for input: 和输入:
junk1 BEGIN junk2 START content STOP junk3 END junk4
I'm getting this: 我得到这个:
match: 'BEGIN junk2 START content STOP junk3 END' # ok
group 'body': ' junk2 START content STOP junk3 ' # ok
group 'A': '' # expected: ' junk2 '
group 'B': not found # expected: 'START content STOP'
group 'C': ' junk2 START content STOP junk3 ' # expected: ' junk3 '
groups A and C putted just for reference purpose A和C组仅供参考
Why group B is not matched even if there is correct data and I get expected result if group B is not optional? 为什么即使没有正确的数据,B组也不匹配,并且如果B组不是可选的,我也会得到预期的结果?
You must have some basic understanding of how the regex engine and backtracking works. 您必须对正则表达式引擎和回溯的工作原理有一些基本的了解。 In this case it goes something like this: 在这种情况下,它是这样的:
Group (?<A>.*?)
will match the empty string at the first attempt (as that is what the expression means: empty string or more if required). 组(?<A>.*?)
将在第一次尝试时匹配空字符串(这就是表达式的含义:如果需要,则为空字符串或更多)。
Then we get to group B
right after BEGIN
, as there is no START
here, the whole inner group will fail, and the optional group will be skipped. 然后我们在BEGIN
之后进入B
组,因为这里没有START
,整个内部组将失败,并且可选组将被跳过。
Group C
will match the empty string. C
组将匹配空字符串。 Then we try to match END
, which doesn't match. 然后,我们尝试匹配不匹配的END
。 Thus the regex engine will try to backtrack the last quantifier. 因此,正则表达式引擎将尝试回溯最后一个量词。 In this case in group C
. 在这种情况下,组C
。 It will do this till a match is found (or try the quantifiers before it, till failing). 它将一直执行到找到匹配项为止(或在尝试之前使用量词,直到失败)。
So we end up with group C
expanding till END
, and then the whole expression matches. 因此,我们最终将C
组扩展到END
,然后整个表达式匹配。
A simple solution could be made if START
/ STOP
are not allowed inside the except for in the optional group you could use an expression like: 如果除了可选组中的不允许使用START
/ STOP
之外,则可以做出一个简单的解决方案,您可以使用以下表达式:
BEGIN
(?<body>
(?<A>
(?: (?!START|STOP) . )*? # do not match START nor STOP
)
(?<B>START.*?STOP)?
(?<C>
(?: (?!START|STOP) . )*? # do not match START nor STOP
)
)
END
In c#, pattern matching is driven by the regular expression pattern (more here ), not the input text. 在c#中,模式匹配是由正则表达式模式( 此处更多信息 )而非输入文本驱动的。 So, what happens is group A is not required to match anything, so decission delayed, group B is not required to match anything, so decission delayed, group C is not required to match anything but end of regular expression is reached. 因此,发生的情况是,组A不需要匹配任何内容,因此决策延迟,B组不需要匹配任何事物,因此决策延迟,C组不需要匹配任何事物,但是达到了正则表达式的结尾。 Group C is matched agains the input string, and gets assigned all what you expect to be in group B. And if you use a right to left pattern matching, all the content will be to group A. C组再次匹配输入字符串,并被分配了您期望在B组中使用的所有内容。如果使用从右到左模式匹配,则所有内容将在A组中。
Constructs earlier in the pattern are prioritized over those later in the pattern, and the lazy quantifier ( *?
) prioritizes shorter matches. 模式前面的构造优先于模式后面的构造,而惰性量词( *?
)优先考虑较短的匹配。 So the best match will always be to not match anything the A
group, and since START
can't match at the first position, it will be skipped. 因此,最佳匹配将始终是不匹配任何A
组,并且由于START
无法在第一个位置匹配,因此将被跳过。 Finally the C
group will eat the rest of the string, since the END
is not optional. 最后, C
组将吃掉其余的字符串,因为END
不是可选的。
Use this instead: 使用此代替:
BEGIN
(?<body>
(?<A>.*?)
(?:
(?<B>START.*?STOP)
(?<C>.*?)
)?
)
END
It will force the A
group to eat as much as it can, up until the first match of B
if it exists. 它将迫使A
组吃得尽可能多,直到B
的第一个匹配(如果存在)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.