替换包括嵌套元素在内的匹配元素

Question

I need to replace all occurences of span having id="comment_n" , where n can be any number and any occurence of this qualifying span can have nested ones. 我需要替换所有具有id="comment_n"的span出现，其中n可以是任何数字，并且该合格span任何出现都可以包含嵌套的。 Each span can have different attributes. 每个span可以具有不同的属性。 Example: 例：

foo <span id="comment_1">text <span id="comment_2" attr="value">text.</span></span> bar

I have this regular expression: 我有这个正则表达式：

<span id="comment_\d+.+?<\/span>

But it doesn't include the last closing span tag. 但是它不包括最后一个结束span标签。

I need to do a replace: 我需要进行替换：

Regex.Replace(input, regex, string.Empty, RegexOptions.Multiline | RegexOptions.IgnoreCase);

Demo: http://regexr.com/3bpkf 演示： http ： //regexr.com/3bpkf

Answer 1

I suggest using HtmlAgilityPack to obtain what you need. 我建议使用HtmlAgilityPack来获得所需的东西。 You can specify the XPath to only get the <span> tags having id attribute that starts with comment_ (case-insensitive) and then remove them. 您可以指定XPath以仅获取具有以comment_ （不区分大小写）开头的id属性的<span>标记，然后将其删除。 Additional check for the number after comment_ can be done with a regex, or without. 可以使用正则表达式（也可以不使用正则表达式）对comment_之后的数字进行其他检查。 Here is a way to remove some tags having specific attribute value where this value is checked with a regex. 这是一种删除某些具有特定属性值的标签的方法，其中使用正则表达式检查该值。

public string HtmlAgilityPackRemoveTagsWithSpecificAttribute(string html, string xpath, string attribute_name, Regex rx)
{
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) &&
                              uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes(xpath);
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           if (rx.IsMatch(node.Attributes[attribute_name].Value))
               node.ParentNode.RemoveChild(node);
       }
    }
    return hap.DocumentNode.OuterHtml;
}

You can use it like this: 您可以像这样使用它：

var res = HtmlAgilityPackRemoveTagsWithSpecificAttribute(html,
  "//span[starts-with(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
           'abcdefghijklmnopqrstuvwxyz'), 'comment_')]", "id", 
                new Regex("(?i)^comment_[0-9]+$"));

Note that translate is used to enable case-insensitive comparison ( comment_ , COMMENT_ , etc.). 请注意， translate用于启用不区分大小写的比较（ comment_ ， COMMENT_等）。 If you do not need that, just use starts-with(@id, 'comment_')]" . 如果不需要，只需使用starts-with(@id, 'comment_')]" 。

The regex can be instantiated before passing to the method if you use it more than once, or use a static Regex.IsMatch and replace the method signature. 如果多次使用正则表达式，或者使用静态Regex.IsMatch并替换方法签名，则可以在传递给方法之前实例化正则表达式。

Answer 2

As to why it doesn't include the last closing span tag, it's because of the ? 至于为什么它不包含最后一个span标记，是因为? in your regex pattern, that makes it "lazy" causing it to match the shortest satisfying string, if you remove that, the match will include the last 'span' tag: 在您的正则表达式模式中，这使其“惰性”，使其匹配最短的令人满意的字符串，如果删除该字符串，则匹配项将包含最后一个“ span”标记：

<span id="comment_\d+.+<\/span>

But I'd suggest using HtmlAgilityPack for parsing your DOM and manipulating it. 但是我建议使用HtmlAgilityPack解析并处理DOM。

替换包括嵌套元素在内的匹配元素

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-09-15 08:15:43

解决方案2
-1 2015-09-15 07:34:29

替换包括嵌套元素在内的匹配元素

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-09-15 08:15:43

解决方案2 -1 2015-09-15 07:34:29

解决方案1
2 已采纳 2015-09-15 08:15:43

解决方案2
-1 2015-09-15 07:34:29