[英]Replace matching elements including nested ones
I need to replace all occurences of span
having id="comment_n"
, where n
can be any number and any occurence of this qualifying span
can have nested ones. 我需要替换所有具有
id="comment_n"
的span
出现,其中n
可以是任何数字,并且该合格span
任何出现都可以包含嵌套的。 Each span
can have different attributes. 每个
span
可以具有不同的属性。 Example: 例:
foo <span id="comment_1">text <span id="comment_2" attr="value">text.</span></span> bar
I have this regular expression: 我有这个正则表达式:
<span id="comment_\d+.+?<\/span>
But it doesn't include the last closing span
tag. 但是它不包括最后一个结束
span
标签。
I need to do a replace: 我需要进行替换:
Regex.Replace(input, regex, string.Empty, RegexOptions.Multiline | RegexOptions.IgnoreCase);
Demo: http://regexr.com/3bpkf 演示: http : //regexr.com/3bpkf
I suggest using HtmlAgilityPack to obtain what you need. 我建议使用HtmlAgilityPack来获得所需的东西。 You can specify the XPath to only get the
<span>
tags having id
attribute that starts with comment_
(case-insensitive) and then remove them. 您可以指定XPath以仅获取具有以
comment_
(不区分大小写)开头的id
属性的<span>
标记,然后将其删除。 Additional check for the number after comment_
can be done with a regex, or without. 可以使用正则表达式(也可以不使用正则表达式)对
comment_
之后的数字进行其他检查。 Here is a way to remove some tags having specific attribute value where this value is checked with a regex. 这是一种删除某些具有特定属性值的标签的方法,其中使用正则表达式检查该值。
public string HtmlAgilityPackRemoveTagsWithSpecificAttribute(string html, string xpath, string attribute_name, Regex rx)
{
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) &&
uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.SelectNodes(xpath);
if (nodes != null)
{
foreach (var node in nodes)
{
if (rx.IsMatch(node.Attributes[attribute_name].Value))
node.ParentNode.RemoveChild(node);
}
}
return hap.DocumentNode.OuterHtml;
}
You can use it like this: 您可以像这样使用它:
var res = HtmlAgilityPackRemoveTagsWithSpecificAttribute(html,
"//span[starts-with(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz'), 'comment_')]", "id",
new Regex("(?i)^comment_[0-9]+$"));
Note that translate
is used to enable case-insensitive comparison ( comment_
, COMMENT_
, etc.). 请注意,
translate
用于启用不区分大小写的比较( comment_
, COMMENT_
等)。 If you do not need that, just use starts-with(@id, 'comment_')]"
. 如果不需要,只需使用
starts-with(@id, 'comment_')]"
。
The regex can be instantiated before passing to the method if you use it more than once, or use a static Regex.IsMatch
and replace the method signature. 如果多次使用正则表达式,或者使用静态
Regex.IsMatch
并替换方法签名,则可以在传递给方法之前实例化正则表达式。
As to why it doesn't include the last closing span
tag, it's because of the ?
至于为什么它不包含最后一个
span
标记,是因为?
in your regex pattern, that makes it "lazy" causing it to match the shortest satisfying string, if you remove that, the match will include the last 'span' tag: 在您的正则表达式模式中,这使其“惰性”,使其匹配最短的令人满意的字符串,如果删除该字符串,则匹配项将包含最后一个“ span”标记:
<span id="comment_\d+.+<\/span>
But I'd suggest using HtmlAgilityPack for parsing your DOM and manipulating it. 但是我建议使用HtmlAgilityPack解析并处理DOM。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.