简体   繁体   English

替换包括嵌套元素在内的匹配元素

[英]Replace matching elements including nested ones

I need to replace all occurences of span having id="comment_n" , where n can be any number and any occurence of this qualifying span can have nested ones. 我需要替换所有具有id="comment_n"span出现,其中n可以是任何数字,并且该合格span任何出现都可以包含嵌套的。 Each span can have different attributes. 每个span可以具有不同的属性。 Example: 例:

foo <span id="comment_1">text <span id="comment_2" attr="value">text.</span></span> bar

I have this regular expression: 我有这个正则表达式:

<span id="comment_\d+.+?<\/span>

But it doesn't include the last closing span tag. 但是它不包括最后一个结束span标签。

I need to do a replace: 我需要进行替换:

Regex.Replace(input, regex, string.Empty, RegexOptions.Multiline | RegexOptions.IgnoreCase);

Demo: http://regexr.com/3bpkf 演示: http//regexr.com/3bpkf

I suggest using HtmlAgilityPack to obtain what you need. 我建议使用HtmlAgilityPack来获得所需的东西。 You can specify the XPath to only get the <span> tags having id attribute that starts with comment_ (case-insensitive) and then remove them. 您可以指定XPath以仅获取具有以comment_ (不区分大小写)开头的id属性的<span>标记,然后将其删除。 Additional check for the number after comment_ can be done with a regex, or without. 可以使用正则表达式(也可以不使用正则表达式)对comment_之后的数字进行其他检查。 Here is a way to remove some tags having specific attribute value where this value is checked with a regex. 这是一种删除某些具有特定属性值的标签的方法,其中使用正则表达式检查该值。

public string HtmlAgilityPackRemoveTagsWithSpecificAttribute(string html, string xpath, string attribute_name, Regex rx)
{
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) &&
                              uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes(xpath);
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           if (rx.IsMatch(node.Attributes[attribute_name].Value))
               node.ParentNode.RemoveChild(node);
       }
    }
    return hap.DocumentNode.OuterHtml;
}

You can use it like this: 您可以像这样使用它:

var res = HtmlAgilityPackRemoveTagsWithSpecificAttribute(html,
  "//span[starts-with(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
           'abcdefghijklmnopqrstuvwxyz'), 'comment_')]", "id", 
                new Regex("(?i)^comment_[0-9]+$"));

Note that translate is used to enable case-insensitive comparison ( comment_ , COMMENT_ , etc.). 请注意, translate用于启用不区分大小写的比较( comment_COMMENT_等)。 If you do not need that, just use starts-with(@id, 'comment_')]" . 如果不需要,只需使用starts-with(@id, 'comment_')]"

The regex can be instantiated before passing to the method if you use it more than once, or use a static Regex.IsMatch and replace the method signature. 如果多次使用正则表达式,或者使用静态Regex.IsMatch并替换方法签名,则可以在传递给方法之前实例化正则表达式。

As to why it doesn't include the last closing span tag, it's because of the ? 至于为什么它不包含最后一个span标记,是因为? in your regex pattern, that makes it "lazy" causing it to match the shortest satisfying string, if you remove that, the match will include the last 'span' tag: 在您的正则表达式模式中,这使其“惰性”,使其匹配最短的令人满意的字符串,如果删除该字符串,则匹配项将包含最后一个“ span”标记:

<span id="comment_\d+.+<\/span>

But I'd suggest using HtmlAgilityPack for parsing your DOM and manipulating it. 但是我建议使用HtmlAgilityPack解析并处理DOM。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM