简体   繁体   中英

RegExp exercise: reluctant quantifier with a lookahead assertion

Can you explain me how this works? Here is an example:

<!-- The quick brown fox 
              jumps over the lazy dog -->

<!--[if IE 7]>
    <link rel="stylesheet" type="text/css" href="/supersheet.css" />
<![endif]-->

<!-- Pack my box with five dozen liquor jugs -->

First, I tried to use the following regular expression to match the content inside conditional comments:

/<!--.*?stylesheet.*?-->/s

It failed, as the regular expression matches all the content before the first <!-- and the last --> . Then I tried using another pattern with a lookahead assertion:

/<!--(?=.*?stylesheet).*?-->/s

It works and matches exactly what I need. However, the following regular expression works as well:

/<!--(?=.*stylesheet).*?-->/s

The last regular expression does not have a reluctant quantifier in the lookahead assertion. And now I am confused. Can anyone explain me how it works? Maybe there is a better solution for this example?

Updated:

I tried usig the regular expressions with lookahead assertion in another document, and it failed to mach the content between the comments. So, this one /<!--(?=.*?stylesheet).*?-->/s (as well as this one /<!--(?=.*stylesheet).*?-->/s ) is not correct. Do not use it and try other suggestions.

Updated:

The solution has been found by Jonny 5 (see the answer). He suggested three options:

  1. Using of a negated hyphen to limit match. This option works only if there is no a hyphen between the tags. If a stylesheet has an URL /style-sheet.css , it will not work.
  2. Using of escape sequence: \\K . It works like a charm. The downsides are the following:
    • It is terribly slow (in my case, it was 8-10 times slower than the other solutions)
    • Only available since PHP 5.2.4
  3. Using a lookahead to narrow the match. This is the goal I tried to achieve, but my expirience of using lookaround assertions was insufficient to perform the task.

I think the following is a good solution for my example:

/(?s)<!--(?:(?!<!).)+?stylesheet.+?-->/

The same but with the s modifier at the end:

/<!--(?:(?!<!).)+?stylesheet.+?-->/s

As I said, this is a good solution, but I managed to improve the pattern and found another one that in my case works faster.

So, the final solution is the following:

/<!--(?:(?!-->).)+?stylesheet.+?-->/s

Thanks all the participants for interesting answers.

The string stylesheet is mentioned only one time in your test document, so both regular expressions you tried will match the same thing but in different ways.

<!--(?=.*?stylesheet).*?-->/s

This one does the following:

  • Capture <!-- .
  • Look ahead, capturing characters up to and including stylesheet . Fail if not found.
  • Capture characters up to and including --> .
<!--(?=.*stylesheet).*?-->/s

This one does the following:

  • Capture <!-- .
  • Look ahead, capturing any character until no longer possible. Backtrack, continuously trying to match stylesheet . Fail if not found.
  • Capture characters up to and including --> .

Basically, one needs to backtrack significantly while the other doesn't.

If your subject instead is...

<!-- The quick brown fox 
              jumps over the lazy dog -->

<!--[if IE 7]>
    <link rel="stylesheet" type="text/css" href="/supersheet.css" /> <![endif]-->

<!-- Pack my box with five dozen stylesheets -->

you get two different results. The former would find the first stylesheet , while the latter would find the second (and last) since it starts searching from the end of the string.

To match only the part <!-- ... stylesheet ... --> there are many ways:

1.) Use a negated hyphen [^-] to limit the match and stay in between <!-- and stylesheet

(?s)<!--[^-]+stylesheet.+?-->

[^-] allows only characters, that are not a hyphen. See test at regex101 .


2.) To get the "last" or closest match without much regex effort, also can put a greedy dot before to ᗧ eat up. Makes sense if not matching globally / only one item to match. Use \\K to reset after the greed:

(?s)^.*\K<!--.+?stylesheet.+?-->

See test at regex101 . Also can use a capture group and grab $1: (?s)^.*(<!--.+?stylesheet.+?-->)


3.) Using a lookahead to narrow it down is usually more costly:

(?s)<!--(?:(?!<!).)+?stylesheet.+?-->

See test at regex101 . (?!<!). looks ahead at each character in between <!-- and stylesheet if not starting another <! ... to stay inside one element. Similar to the negated hyphen solution.


Instead of .* I used .+ for one or more - depends on what to be matched. Here + fits better.
What solution to use depends on the exact requirements. For this case I would use the first.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM