简体   繁体   中英

Regex of a non-greedy match different behavior

I found that non-greedy regex match only become non-greedy when anchoring to the front, not to the end:

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*c)/'
abcabcabc
# OK, greedy match

$ echo abcabcabc | perl -ne 'print $1 if /^(a.*?c)/'
abc
# YES! non-greedy match

Now look at this, when anchoring to the end:

$ echo abcabcabc | perl -ne 'print $1 if /(a.*c)$/'
abcabcabc
# OK, greedy match

$ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/'
abcabcabc
# what, non-greedy become greedy?

why is that? how come it doesn't print abc as before?

(The problem was found in my Go code, but illustrated in Perl for simplicity).

 $ echo abcabcabc | perl -ne 'print $1 if /(a.*?c)$/' abcabcabc # what, non-greedy become greedy? 

Non-greedy means it'll match the fewest characters possible at the current location such that the entire pattern matches.

After matching a at position 0 , bcabcab is the least .*? can match at position 1 while still satisfying the rest of the pattern.

"abcabcabc" = /a.*?c$/ in detail:

  1. At pos 0, a matches 1 char ( a ).
    1. At pos 1, .*? matches 0 chars (empty string).
      1. At pos 1, c fails to match. Backtrack!
    2. At pos 1, .*? matches 1 char ( b ).
      1. At pos 2, c matches 1 char ( c ).
        1. At pos 3, $ fails to match. Backtrack!
    3. At pos 1, .*? matches 2 chars ( bc ).
      1. At pos 1, c fails to match. Backtrack!
    4. ...
    5. At pos 1, .*? matches 7 chars ( bcabcab ).
      1. At pos 8, c matches 1 char ( c ).
        1. At pos 9, $ matches 0 chars (empty string). Match successful!

"abcabcabc" = /a.*c$/ in detail (for contrast):

  1. At pos 0, a matches 1 char ( a ).
    1. At pos 1, .* matches 8 chars ( abcabcabc ).
      1. At pos 9, c fails to match. Backtrack!
    2. At pos 1, .* matches 7 chars ( abcabcab ).
      1. At pos 8, c matches 1 char ( c ).
        1. At pos 9, $ matches 0 chars (empty string). Match successful!

Tip: Avoid patterns with two instances of a non-greediness modifier. Unless you are using them as an optimization, there's a good chance they can match something you don't want them to match. This is relevant here because patterns implicitly start with \\G(?s:.*?)\\K (unless cancelled out by a leading ^ , \\A or \\G ).

What you want is one of the following:

/a[^a]*c$/
/a[^c]*c$/
/a[^ac]*c$/

You could also use one of the following:

/a(?:(?!a).)c$/s
/a(?:(?!c).)c$/s
/a(?:(?!a|c).)c$/s

It would be inefficient and unreadable to use these latter three in this situation, but they will work with boundaries that are longer than one character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM