简体   繁体   中英

Strip all Links From HTML String - Regex?

I have a string which is basically a block of content with normal formatting (p tags, bold etc..) and sometimes contains HTML links editors have put in.

But I want to keep all the other HTML, but just strip out the links. But not sure the fastest and most efficient way of doing this as the string can be large (As they are Articles)

Any code sample greatly appreciated :)

Not very accurate, but a lazy apprach would be to replace "<a " with "<span " and "</a>" with "</span>" . A more accurate result would be to parse it in a DOM:

string html;
using (var client = new WebClient()) {
    html = client.DownloadString("http://stackoverflow.com");
}
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlNode node;
// loop this way to avoid issues with nesting, mutating the set, etc
while((node = doc.DocumentNode.SelectSingleNode("//a")) != null) {
    var span = doc.CreateElement("span");
    span.InnerHtml = node.InnerHtml;
    node.ParentNode.InsertAfter(span, node);
    node.Remove();
}
string final = doc.DocumentNode.OuterHtml;

Note, however, that removing the link tags may change the styling , for example if there is a css style of the form a.someClass { ... } or a someNested {...}

Note on the code above; you could also try the more direct:

foreach(var node in doc.DocumentNode.SelectNodes("//a")) {
    var span = doc.CreateElement("span");
    span.InnerHtml = node.InnerHtml;
    node.ParentNode.InsertAfter(span, node);
    node.Remove();
}

but I wasn't sure if this might cause issues with mutation/iteration for some nesting constructs...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM