简体   繁体   English

使用Jsoup保留内部HTML时,HTML解析和除去锚标记

[英]HTML Parsing and removing anchor tags while preserving inner html using Jsoup

I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags 我必须解析一些html并删除定位标记,但是我需要保留定位标记的innerHTML

For example, if my html text is: 例如,如果我的html文本是:

String html = "<div> <p> some text <a href="#"> some link text </a> </p> </div>"

Now I can parse the above html and select for a tag in jsoup like this, 现在,我可以解析上面的html并在jsoup中选择一个标签,如下所示:

Document doc = Jsoup.parse(inputHtml);

//this would give me all elements which have anchor tag
Elements elements = doc.select("a");

and I can remove all of them by, 我可以删除所有的

element.remove()

But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags. 但是它将从开始括号到结束括号中删除完整的achor标记,并且内部html将会丢失,如何保留仅删除start和close标记的内部HTML。

Also, Please Note : I know there are methods to get outerHTML() and innerHTML() from the element, but those methods only give me ways to retrieve the text, the remove() method removes the complete html of the tag. 另外,请注意:我知道有一些方法可以从元素中获取externalHTML()和innerHTML(),但是这些方法仅提供了检索文本的方法,remove()方法将删除标记的完整html。 Is there any way in which I can only remove the outer tags and preserve the innerHTML ? 有什么方法只能删除外部标签并保留innerHTML?

Thanks a lot in advance and appreciate your help. 在此先感谢您,并感谢您的帮助。

--Rajesh -拉杰什

use unwrap, it preserves the inner html 使用unwrap,它将保留内部html

doc.select("a").unwrap();

check the api-docs for more info: 检查api-docs以获得更多信息:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29 http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29

How about extracting the inner HTML first, adding it to the DOM and then removing your tags? 首先提取内部HTML,然后将其添加到DOM,然后再删除标签,该如何做? This code is untested, but should do the trick: 这段代码未经测试,但是应该可以解决:

Edit: 编辑:

I updated the code to use replaceWith() , making the code more intuitive and probably more efficient; 我更新了代码以使用replaceWith() ,使代码更直观,也可能更高效; thanks to AJ 's hint in the comments. 感谢AJ在评论中的提示。

Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
    Node linkText = new TextNode(link.html(), baseUri);
    // optionally wrap it in a tag instead:
    // Element linkText = doc.createElement("span");
    // linkText.html(link.html());
    link.replaceWith(linkText);
}

Instead of using a text node, you can wrap the inner html in anything you want; 您可以使用所需的任何内容包装内部html,而不是使用文本节点。 you might even have to, if there's not just text inside your links. 如果链接中不仅包含文本,甚至可能需要这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM