简体   繁体   中英

Jsoup - Keep only the tags and remove all the text

I am trying to remove all the texts between the tags of an HTML page using Jsoup

For example, if the input HTML is

<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

The output should be

<!DOCTYPE html>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>

Basically, I want to remove what is returned by doc.text()

I have found a lot of posts to do the contrary and keep only the text, but nothing to solve my problem. Any idea on how to do this?

EDIT

The solution proposed by maverick9999 : https://stackoverflow.com/a/24292349/3589481 will solve most of the cases.

However, as noticed in comments this solution will also remove the nested tags.

As an example:

    String str = "<!DOCTYPE html>" +
                "<html>" +
                "<body>" +
                "<div class='foo'>text <div class='THIS DIV WILL BE REMOVED'>text</div> text </div>" +
                "<h1>My First Heading</h1>\n" +
                "<p>My first paragraph.</p>\n" +
                "</body>\n" +
                "</html>";

        Document doc=Jsoup.parse(str);
        removeAllTexts(doc);
        System.out.println(doc);

        Elements all=doc.select("*");
        Iterator<Element>iterator=all.iterator();
        while(iterator.hasNext()){
            Element e=iterator.next();
            if(!e.ownText().isEmpty()){
                e.text("");
            }
        }

        System.out.println(doc);

Will remove one div in the output:

    <html>
     <head></head>
     <body>
      <div class="foo">
      </div>
     </body>
    </html>

Any thoughts to avoid this?

EDIT 2

For some reason, the tag "meta" is considered as self-closing by Jsoup. So if you have something like this:

System.out.println("\n\n----");
String html = "<!DOCTYPE html>\r\n"
+ "<html>\r\n"
+ "<head>\n" 
+ "<meta content=\"/myimage.png\" itemprop=\"image\">\n"
+ "<title>Title</title>\n" 
+ "<script>Random Javascript here</script>"
+ "</meta>"
+ "</head>"
+ "<body>\r\n"
+ "<h1>My First <i>Heading</i></h1>\r\n"
+ "<hr/>\r\n"
+ "<p>My first paragraph.</p>\r\n"
+ "<p> <div class='foo'>text <div class='bar'> text </div> text </div> </p>\r\n"
+ "</body>\r\n" 
+ "</html>";

Document doc2 = Jsoup.parse(html,"",Parser.xmlParser());
printNodes(doc2);

Then all the tags after meta will not be read. With Pshemo solution, the scripts are removed and if you have br tags with children (for example), they will be removed as well. I finally ended up with the following solution (thanks to Pshemo for his help):

   public static void printNodes(Node node) {
        String name = node.nodeName();
        if (name.equals("#doctype")) {
            System.out.println(node);
        } else if (name.equals("#text")) {
            return;
        } else if (name.equals("#document")) {
            for (Node n : node.childNodes())
                printNodes(n);
        } 
        // There is no reason to have text here, so print everything
        else if (name.equals("head") || name.equals("script")){
            System.out.println(node.toString());
        }
        else {
            if (!Tag.valueOf(name).isSelfClosing() || node.childNodeSize()>0) {
                System.out.println("<" + name + getAttributes(node) + ">");
                for (Node n : node.childNodes())
                    printNodes(n);
                System.out.println("</" + name + ">");
            } else {
                // System.out.println("debug: " + name + " is self closing");
                System.out.println("<" + name + getAttributes(node) + "/>");
            }
        }
    }

   public static String getAttributes(Node node) {
        StringBuilder sb = new StringBuilder();
        for (Attribute attr : node.attributes()) {
            sb.append(" ").append(attr.getKey()).append("=\"")
                    .append(attr.getValue()).append("\"");
        }
        return sb.toString();
    }

The below code should solve your problem with nested tags:

Updated code:

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

for (Element el : doc.select("*")){
    if (!el.ownText().isEmpty()){
        for (TextNode node : el.textNodes())
            node.remove();
    }
}

System.out.println(doc);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM