简体   繁体   English

正则表达式所有格量词,用于懒惰或贪婪

[英]regex possessive quantifier against lazy or greedy

Can anyone explain me, step by step, why the regex fails with this: 谁能一步一步解释我,为什么正则表达式会失败:

<.++>

with this string to compare: <em> 与此字符串进行比较: <em>

The same string is found with lazy or greedy quantifiers but in this case what steps are involved? 使用懒惰或贪婪的量词可以找到相同的字符串,但是在这种情况下,涉及哪些步骤?

I use Java regex flavor. 我使用Java regex风格。

From the Java Pattern documentation : Java Pattern文档中

Possessive quantifiers, which greedily match as much as they can and do not back off, even when doing so would allow the overall match to succeed. 所有格量词尽可能贪婪地匹配并且不会后退,即使这样做也会使整体匹配成功。

In your example, the < in your regex matches < in the string, then .++ matches the entire rest of the string, em> . 在示例中, <在您正则表达式的匹配<串中,然后.++字符串的整个其余部分相匹配, em> You still have a > in your regex, but there are no characters left in the string for it to match (because .++ consumed them all). 您的正则表达式中仍然有一个> ,但是字符串中没有要匹配的字符(因为.++消耗了所有字符)。 So the match fails. 因此,匹配失败。

If the quantifier were greedy, ie if it were .+ instead of .++ , at this point the regular expression engine would try reducing the portion matched by .+ by one character, to just em , and try again. 如果量词是贪婪的,即如果它是.+而不是.++ ,则此时正则表达式引擎将尝试将与.+匹配的部分减少一个字符,使其变为em ,然后重试。 This time the match would succeed, because there would be a > left in the string for the > in the regex to match. 这次匹配将成功,因为在字符串中将有一个>以便正则表达式中的>匹配。

EDIT: A lazy quantifier would work like a greedy quantifier in reverse. 编辑:懒惰的量词将像贪婪的量词相反地工作。 Instead of starting by trying to match the whole rest of the string and backing off character by character, the lazy quantifier would start by trying to match a single character, in this case just e . 惰性量词不是通过尝试匹配整个字符串的其余部分并逐字符地退避,而是通过尝试匹配单个字符(在本例中为e If that doesn't allow the full regex to match (which it wouldn't here, because you'd have > in the regex trying to match m in the string), the lazy quantifier would move up to matching two characters, em . 如果那不能使完整的正则表达式匹配(在这里是不对的,因为在正则表达式中有>试图匹配字符串中的m ),那么惰性量词将向上移动以匹配两个字符em Then the > in the regex would line up with > in the string and the match would succeed. 然后,正则表达式中的>将与字符串中的>对齐,匹配将成功。 If it didn't work out, though, the lazy quantifier would move up to three characters, and so on. 但是,如果无法解决问题,则惰性量词最多可以移动三个字符,依此类推。

Possessive quantifier prevents backtracking - thus .++ part matches the remaining string em> , eating up the last > also. 拥有量词可防止回溯-因此.++部分与其余字符串em>匹配,也占用了最后一个>

Hence the last > of the regex has no match and the regex fails. 因此,正则表达式的最后一个>不匹配,并且正则表达式失败。

Like a greedy quantifier, a possessive quantifier will repeat the token as many times as possible. 像贪婪的量词一样,所有格量词将重复令牌多次。 Unlike a greedy quantifier, it will not give up matches as the engine backtracks. 与贪婪的量词不同,它不会因为引擎回溯而放弃匹配。 With a possessive quantifier, the deal is all or nothing. 使用所有格量词,这笔交易是全部或全部。 You can make a quantifier possessive by placing an extra + after it. 您可以通过在其后加上一个额外的+来使量词具有所有格。

On greedy variant 贪婪变体

First let's consider how a pattern like <.+> matches against <em> : 首先让我们考虑一下<.+>类的模式如何与<em>相匹配:

  • The < in the pattern matches the < in the input. 所述<在图案匹配的<在输入。
  • Then .+ matches em> in the input (because it's greedy, it'll first match as many . as possible) 然后.+匹配输入中的em> (因为它是贪婪的,因此将首先匹配尽可能多的.
    • Then > doesn't match, since there are no more characters in the input 然后>不匹配,因为输入中没有更多字符
  • At this point .+ backtracks and must match one less . 此时.+回溯,并且必须少匹配一个. ; ; so .+ now matches em 所以.+现在匹配em
  • Now > in the pattern matches the > in the input. 如今>在模式匹配>输入。

On reluctant variant 在勉强的变体上

By contrast, this is how <.+?> matches against <em> : 相比之下,这是<.+?><em>匹配的方式:

  • The < in the pattern matches the < in the input. 所述<在图案匹配的<在输入。
  • Then .+? 然后.+? matches e in the input (because it's reluctant, but must take at least one . ) 匹配输入中的e (因为它是勉强的,但必须至少取一个.
    • Then > doesn't match, since the rest of the input is m> 然后>不匹配,因为其余输入为m>
  • At this point .+ backtracks and must match one more . 此时.+回溯,并且必须再匹配一个. ; ; so .+? 那么.+? now matches em 现在匹配em
  • Now > in the pattern matches the > in the input. 如今>在模式匹配>输入。

On negated character class and possessive quantifiers combo 关于否定的字符类和所有格量词组合

Note that in either of the above cases, .+ or .+? 请注意,在上述两种情况下, .+.+? must backtrack for the > to match. 必须后退>才能匹配。 This is why <.++> can NEVER match <em> , because here's what happens: 这就是<.++> 永远不能匹配<em>原因,因为这是发生了以下情况:

  • The < in the pattern matches the < in the input 所述<在图案匹配的<在输入
  • Then .++ matches as many . 然后.++匹配多个. in the input, and will be in possession of this match 在输入中,并将拥有此比赛
    • It will not let go whatever it matched! 它不会放任不管! (hence "possessive") (因此为“拥有”)
    • In this case, .++ is able to match em> 在这种情况下, .++可以匹配em>
  • Now > in the pattern can never match, because any > will be gobbled up by .++ 现在>模式中的>永远不会匹配,因为任何>都会被.++吞噬
    • Since it's possessive, .++ will not "cooperate" by giving back the > 由于具有占有欲, .++不会通过退回> “合作”

A pattern that at least has a chance to match is <[^>]++> . 至少有机会匹配的模式是<[^>]++> When matched against <em> : 当与<em>匹配时:

  • The < in the pattern matches the < in the input 所述<在图案匹配的<在输入
  • Then [^>]++ possessively matches as many [^>] in the input (ie anything but > ) 然后[^>]++占有所有输入中的[^>] (即>任何内容)
    • In this case it will possessively match em 在这种情况下,它将完全匹配em
  • Now > in the pattern can match the > in the input 现在>在图案可以匹配>在输入

As much as is practical, you should refrain from using .*? 在实际中,您应该避免使用.*? / .* in your pattern. .*在您的模式中。 The . . is too flexible since it matches (almost!) any character, and this can cause unnecessary backtracking and/or overmatching. 太灵活了,因为它匹配(几乎!)任何字符,并且这可能导致不必要的回溯和/或过度匹配。

Whenever applicable, you should use negated character class instead of . 只要适用,就应该使用否定的字符类代替.

regular-expressions.info regular-expressions.info

Related questions 相关问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM