简体   繁体   中英

Java Jsoup : Extract all the text

I have the follwing code. The doc.body.text() statement doesn't output the text within the style and the script tags. I read the .text() function code , and it looks for all instances of TextNode. What is a TextNode in Jsoup.

And why is the script text not included in the .text() output.

String contex = "<html><body><style>style</style><div>div</div><script>script</script><p>paragraph</p>body</body></html>";
    Document doc = Jsoup.parse(contex, "UTF-8");
    String text = doc.body().text();
    System.out.println("Test text : " + text);

Output : paragraphbody

For this you need to use org.jsoup.select.Elements to parse the tags like <script> .

String contex = "<html><body><style>style</style><div>div</div><script>scripts</script><p>paragraph</p><p>body</p><script>787878</script></body></html>";
        Document doc =Jsoup.parse(contex, "UTF-8");
         Elements scriptElements = doc.getElementsByTag("script");

         for (Element el :scriptElements ){                
                for (DataNode dn : el.dataNodes()) {
                    System.out.println(dn.getWholeData());
                }
          }

OP:

scripts
787878

And why is the script text not included in the .text() output.

Because script and style has data , not the text .

To get data from script 's data, use getElementsByTag

Elements scriptElements = doc.getElementsByTag("script");

and access by getWholeData

for (Element element :scriptElements ){                
    for (DataNode node : element.dataNodes()) {
        System.out.println(node.getWholeData());
    }
    System.out.println("-------------------");            
}

As per source code , for style or script tag is treated as dataNode instead of textNode

 void insert(Token.Character characterToken) { Node node; // characters in script and style go in as datanodes, not text nodes final String tagName = currentElement().tagName(); final String data = characterToken.getData(); if (characterToken.isCData()) node = new CDataNode(data); else if (tagName.equals("script") || tagName.equals("style")) node = new DataNode(data); else node = new TextNode(data); currentElement().appendChild(node); // doesn't use insertNode, because we don't foster these; and will always have a stack. } 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM