简体   繁体   中英

Understanding RegEx that finds multiline HTML comments

I have a RegEx found here http://regexadvice.com/forums/thread/36397.aspx and I am looking for an explanation of a behavior that I don't understand. The RegEx is supposed to find multi-line HTML comments that are NOT inside script or style tags. I'm using it to build an app that can strip browser accessible comments post-build. For example, find this

<!-- I am an ordinary comment
and I need two lines -->

but not this

<script language="javascript1.2">
<!--
function window_Onload()
{   
    alert('I am the on load event');
}
window.onload=window_Onload;
//-->
</script>

Once found, I can remove the first comment chunk while ignoring the second.

The following pattern works absolutely beautifully to accomplish the above:

string multilinePattern = @"<!--((?!-->).)+-->(?>((?!</?(script|style)).)*)(?!</(script|style))";
match = Regex.Match(text, multilinePattern);
                    if (match.Success)
                    {
                        output.WriteLine("{0}", match.Value);
                    }

This code will give me a file with all of the html comments NOT inside a script or style tag, but it does something else I don't get.

Here's Example 1 of HTML and the return:

HTML:

<!-- Outside Table -->
<table summary="<%= GetLocalResourceObject("LayoutTable.SummaryText") %>" cellspacing="0" cellpadding="0" border="0" width="650" align="center">
    <tr>
        <td class="tableHeader">&nbsp;</td>

Returns:

<!-- Outside Table -->

Now, here's Example 2 of HTML and the return:

HTML:

<!-- Outside Table -->

<table  summary="<%= GetLocalResourceObject("LayoutTable.SummaryText") %>" class="tabTableCell"   cellpadding="0" cellspacing="0" width="750" align="center" >

    <tr>

        <td class="tableHeader">&nbsp;</td>

Returns:

<!-- Outside Table -->

<table  summary="<%= GetLocalResourceObject("LayoutTable.SummaryText") %>" class="tabTableCell"   cellpadding="0" cellspacing="0" width="750" align="center" >

    <tr>

Example 2 is the wrong one: I don't want to include that chunk of html in the match result. But the only difference I can see between Examples 1 and 2 is the extra line break that follows the "Outside Table" notation in Example 2.

So my question is, what is it in the Regex that's causing the match to include the html all the way up to the TR tag in example 2. What would I have to change to get the Regex to match Example 2 the same way as example 1?

OK here is how it could be done with HtmlAgilityPack

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var comments = doc.DocumentNode
                .Descendants()
                .Where(d => d.Name == "#comment")
                .Select(d => d.InnerText)
                .ToList();

In my tests it matches just the comment in both cases. But if I specify the Singleline option (which you should be doing), it matches the whole shebang in both cases. Could it be that you're doing the match in Singleline mode in your second test, but not the first?

But that's a bad regex anyway. After the comment is matched, the atomic group matches and consumes anything that's not a SCRIPT or STYLE tag (opening or closing), and then the lookahead asserts that what follows is not a closing SCRIPT or STYLE tag.

You don't want to consume anything after the end of the comment; that should all be in one negative lookahead. For example:

(?inxs)
<!--((?!-->).)+-->
(?!
  ((?!</?(script|style)).)*
  </(script|style)
)

(?inxs) is an inline mode modifier; it switches on IgnoreCase , ExplicitCapture , IgnorePatternWhitespace , and Singleline modes. Here it is again, all in one line as a C# verbatim string:

@"(?ins)<!--((?!-->).)+-->(?!((?!</?(script|style)).)*</(script|style))"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM