简体   繁体   English

了解可查找多行HTML注释的RegEx

[英]Understanding RegEx that finds multiline HTML comments

I have a RegEx found here http://regexadvice.com/forums/thread/36397.aspx and I am looking for an explanation of a behavior that I don't understand. 我在这里http://regexadvice.com/forums/thread/36397.aspx找到了RegEx,我正在寻找一种我不理解的行为的解释。 The RegEx is supposed to find multi-line HTML comments that are NOT inside script or style tags. RegEx应该找到不在脚本或样式标签内的多行HTML注释。 I'm using it to build an app that can strip browser accessible comments post-build. 我正在用它来构建一个可以在构建后剥离浏览器可访问评论的应用程序。 For example, find this 例如找到这个

<!-- I am an ordinary comment
and I need two lines -->

but not this 但不是这个

<script language="javascript1.2">
<!--
function window_Onload()
{   
    alert('I am the on load event');
}
window.onload=window_Onload;
//-->
</script>

Once found, I can remove the first comment chunk while ignoring the second. 找到后,我可以删除第一个评论块,而忽略第二个。

The following pattern works absolutely beautifully to accomplish the above: 以下模式可以完美地完成上述任务:

string multilinePattern = @"<!--((?!-->).)+-->(?>((?!</?(script|style)).)*)(?!</(script|style))";
match = Regex.Match(text, multilinePattern);
                    if (match.Success)
                    {
                        output.WriteLine("{0}", match.Value);
                    }

This code will give me a file with all of the html comments NOT inside a script or style tag, but it does something else I don't get. 这段代码将为我提供一个文件,其中不包含所有html注释,而不会包含在脚本或样式标签中,但是它可以做其他我没有得到的事情。

Here's Example 1 of HTML and the return: 这是HTML的示例1和返回值:

HTML: HTML:

<!-- Outside Table -->
<table summary="<%= GetLocalResourceObject("LayoutTable.SummaryText") %>" cellspacing="0" cellpadding="0" border="0" width="650" align="center">
    <tr>
        <td class="tableHeader">&nbsp;</td>

Returns: 返回:

<!-- Outside Table -->

Now, here's Example 2 of HTML and the return: 现在,这是HTML的示例2和返回值:

HTML: HTML:

<!-- Outside Table -->

<table  summary="<%= GetLocalResourceObject("LayoutTable.SummaryText") %>" class="tabTableCell"   cellpadding="0" cellspacing="0" width="750" align="center" >

    <tr>

        <td class="tableHeader">&nbsp;</td>

Returns: 返回:

<!-- Outside Table -->

<table  summary="<%= GetLocalResourceObject("LayoutTable.SummaryText") %>" class="tabTableCell"   cellpadding="0" cellspacing="0" width="750" align="center" >

    <tr>

Example 2 is the wrong one: I don't want to include that chunk of html in the match result. 示例2是错误的示例:我不想在匹配结果中包含那部分html。 But the only difference I can see between Examples 1 and 2 is the extra line break that follows the "Outside Table" notation in Example 2. 但是我可以看到的示例1和2之间的唯一区别是示例2中遵循“外部表”符号的额外换行符。

So my question is, what is it in the Regex that's causing the match to include the html all the way up to the TR tag in example 2. What would I have to change to get the Regex to match Example 2 the same way as example 1? 所以我的问题是,正则表达式中是什么导致匹配在示例2中一直包含html直到html标记。我需要更改什么才能使正则表达式与示例2相同来匹配示例2 1?

OK here is how it could be done with HtmlAgilityPack 好的,这是使用HtmlAgilityPack可以完成的

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var comments = doc.DocumentNode
                .Descendants()
                .Where(d => d.Name == "#comment")
                .Select(d => d.InnerText)
                .ToList();

In my tests it matches just the comment in both cases. 在我的测试中,在两种情况下,它仅与注释匹配。 But if I specify the Singleline option (which you should be doing), it matches the whole shebang in both cases. 但是,如果我指定“ Singleline选项(您应该这样做),则在两种情况下它都将匹配整个shebang。 Could it be that you're doing the match in Singleline mode in your second test, but not the first? 难道是您在第二项测试中以“ Singleline模式进行比赛,而不是第一次进行?

But that's a bad regex anyway. 但这仍然是一个不好的正则表达式。 After the comment is matched, the atomic group matches and consumes anything that's not a SCRIPT or STYLE tag (opening or closing), and then the lookahead asserts that what follows is not a closing SCRIPT or STYLE tag. 注释匹配后,原子组匹配并消耗不是SCRIPT或STYLE标记的任何内容(打开或关闭),然后先行断言后面的内容不是SCRIPT或STYLE标记。

You don't want to consume anything after the end of the comment; 评论结束后,您不想消耗任何东西; that should all be in one negative lookahead. 都应该是负面的前瞻。 For example: 例如:

(?inxs)
<!--((?!-->).)+-->
(?!
  ((?!</?(script|style)).)*
  </(script|style)
)

(?inxs) is an inline mode modifier; (?inxs)是内联模式修饰符; it switches on IgnoreCase , ExplicitCapture , IgnorePatternWhitespace , and Singleline modes. 它接通IgnoreCaseExplicitCaptureIgnorePatternWhitespace ,和Singleline模式。 Here it is again, all in one line as a C# verbatim string: 这又是一行,都是C#逐字字符串:

@"(?ins)<!--((?!-->).)+-->(?!((?!</?(script|style)).)*</(script|style))"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM