简体   繁体   中英

HTML Parsing and removing anchor tags while preserving inner html using Jsoup

I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags

For example, if my html text is:

String html = "<div> <p> some text <a href="#"> some link text </a> </p> </div>"

Now I can parse the above html and select for a tag in jsoup like this,

Document doc = Jsoup.parse(inputHtml);

//this would give me all elements which have anchor tag
Elements elements = doc.select("a");

and I can remove all of them by,

element.remove()

But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags.

Also, Please Note : I know there are methods to get outerHTML() and innerHTML() from the element, but those methods only give me ways to retrieve the text, the remove() method removes the complete html of the tag. Is there any way in which I can only remove the outer tags and preserve the innerHTML ?

Thanks a lot in advance and appreciate your help.

--Rajesh

use unwrap, it preserves the inner html

doc.select("a").unwrap();

check the api-docs for more info:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29

How about extracting the inner HTML first, adding it to the DOM and then removing your tags? This code is untested, but should do the trick:

Edit:

I updated the code to use replaceWith() , making the code more intuitive and probably more efficient; thanks to AJ 's hint in the comments.

Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
    Node linkText = new TextNode(link.html(), baseUri);
    // optionally wrap it in a tag instead:
    // Element linkText = doc.createElement("span");
    // linkText.html(link.html());
    link.replaceWith(linkText);
}

Instead of using a text node, you can wrap the inner html in anything you want; you might even have to, if there's not just text inside your links.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM