简体   繁体   English

c#regex中命名捕获的问题

[英]Problems with named capturing in c# regex

I've been struggling with this for a while 我一直在努力解决这个问题

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?<tag>[^\s/>]+)(?<innerHtml>.*)(?<closeTag>[^\s>]+)>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

string tag = matches[0].Groups["tag"].Value; // "h2"
string innerHtml = matches[0].Groups["innerHtml"].Value; // ">hello world</h"
string closeTag = matches[0].Groups["closeTag"].Value; // "2"

As can be seen tag works as expected while the innerHtml and closeTag does not. 可以看出tag按预期工作,而innerHtmlcloseTag则没有。 Any advice? 有什么建议? Thanks. 谢谢。

Update 更新

The input string may vary, this is another scenario "<div class='myclass'><h2>hello world</h2></div>" 输入字符串可能会有所不同,这是另一种情况"<div class='myclass'><h2>hello world</h2></div>"

Try matching the > and </ outside of the capture groups, like this: 尝试匹配捕获组的></外部,如下所示:

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?<tag>[^\s/>]+)>(?<innerHtml>.*)</(?<closeTag>[^\s>]+)>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

Update More specific example that should be a little more flexible: 更新更具体的示例,应该更灵活一些:

var matches = Regex.Matches(
    "<div class='myclass'><h2>hello world</h2></div>",
    @"<(?<tag>[^\s>]+)               #Opening tag
        \s*(?<attributes>[^>]*)\s*>  #Attributes inside tag (optional)
      (?<innerHtml>.*)               #Inner Html
      </(?<closeTag>\1)>             #Closing tag, must match opening tag",
    RegexOptions.IgnoreCase | 
    RegexOptions.Compiled | 
    RegexOptions.Multiline |
    RegexOptions.IgnorePatternWhitespace);

string tag = matches[0].Groups["tag"].Value;             // "div"
string attr = matches[0].Groups["attributes"].Value;     // "class='myclass'"
string innerHtml = matches[0].Groups["innerHtml"].Value; // "<h2>hello world</h2>"
string closeTag = matches[0].Groups["closeTag"].Value;   // "div"

You want the Singleline option, not Multiline . 您需要Singleline选项,而不是Multiline Singleline enables . Singleline启用. to match linefeeds, while Multiline changes the behavior of the anchors ( ^ and $ ), which you aren't using. 匹配换行符,而Multiline更改您没有使用的锚点( ^$ )的行为。

Also, if you want the closing tag to have the same name as the opening tag, you should use a backreference. 此外,如果您希望结束标记与开始标记具有相同的名称,则应使用反向引用。 Here I've used '' as the name delimiters instead of <> to reduce confusion: 在这里,我使用''作为名称分隔符而不是<>来减少混淆:

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?'tag'[^/>]+)(?'innerHtml'.*)</\k'tag'>",
    RegexOptions.IgnoreCase | RegexOptions.Singleline);

And you don't need the Compiled option. 而且您不需要Compiled选项。 All it does is make it more expensive to create the Regex object, for an increase in performance that you almost certainly don't need and won't notice. 它所做的只是让创建Regex对象变得更加昂贵,因为你几乎肯定不需要也不会注意到性能的提升。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM