简体   繁体   English

使用 Jsoup 从 html 字符串中仅提取 HTML 标签和属性

[英]Extract only HTML tags and attributes from a html string using Jsoup

I want to fetch only the HTML content along with the attributes and remove the text.我只想获取 HTML 内容以及属性并删除文本。

Input String:输入字符串:

String html = "<p>An <br/><b></b> <a href='http://example.com/' target=\"h\"> <b> example <a><p></b>this is  the </a> link </p>";

Output输出

<p><br></br><b></b><a href="http://example.com/" target="h"><b><a><p></p></a></b></a></p>

Edit: Most of the questions in google or stackoverflow are only related to removing the html and extract text only.编辑: google 或 stackoverflow 中的大多数问题仅与删除 html 和仅提取文本有关。 I spent around 3 hours to come across the below mentioned solutions.我花了大约 3 个小时来遇到下面提到的解决方案。 So posting it here as it will help others所以把它贴在这里因为它会帮助别人

Hope this helps someone like me looking to remove only the text content from the HTML string.希望这可以帮助像我这样希望仅从 HTML 字符串中删除文本内容的人。

Output输出

<p><br></br><b></b><a href="http://example.com/" target="h"><b><a><p></p></a></b></a></p>
String html = "<p>An <br/><b></b> <a href='http://example.com/' target=\"h\"> <b> example <a><p></b>this is  the </a> link </p>";
       Traverser traverser = new Traverser();

       Document document = Jsoup.parse(html, "", Parser.xmlParser());// you can use the html parser as well. which will add the html tags

       document.traverse(traverser);
       System.out.println(traverser.extractHtmlBuilder.toString());

By appending the node.attributes will includes all the attributes.通过附加 node.attributes 将包括所有属性。

    public static class Traverser implements NodeVisitor {

        StringBuilder extractHtmlBuilder = new StringBuilder();

        @Override
        public void head(Node node, int depth) {
            if (node instanceof Element && !(node instanceof Document)) {
                extractHtmlBuilder.append("<").append(node.nodeName()).append(node.attributes()).append(">");
            }
        }

        @Override
        public void tail(Node node, int depth) {
            if (node instanceof Element && !(node instanceof Document)) {
                extractHtmlBuilder.append("</").append(node.nodeName()).append(">");
            }
        }
    }

Another Solution:另一个解决方案:

 Document document = Jsoup.parse(html, "", Parser.xmlParser());
        for (Element element : document.select("*")) {
            if (!element.ownText().isEmpty()) {
                for (TextNode node : element.textNodes())
                    node.remove();
            }
        }
        System.out.println(document.toString());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM