简体   繁体   中英

How to get orphaned text with Jsoup?

I have an html:

<span>This is the first text</span>
More text here 
Another line of text
<span>Text in the span</span>
<span>Another text in span</span>
This is another line

I want to get all the texts in order, something like this array:

[
"Span:This is the first text",
"More text here",
"Another line of text",
"Span:Text in the span",
"Span:Another text in span",
"This is another line",
]

I would go with a recursive method that takes your starting tag and iterates over its child nodes. For each TextNode, print the contents. For each Element, check it for child nodes.

public static void main(String[] args) throws ParseException, IOException
{
    //I put your HTML in the body tag in a local file
    Document doc = Jsoup.parse(new File("input/20160505.html"), "UTF-8");
    Elements elements = doc.getElementsByTag("body");
    Element rootTag = elements.get(0);
    printTextOfTag(rootTag);
}

public static void printTextOfTag(Element currentTag)
{
    List<Node> nodes = currentTag.childNodes();
    for(Node n : nodes)
    {
        if(n instanceof TextNode)
        {
            System.out.println(((TextNode)n).text());
        }
        else if(n instanceof Element)
        {
            printTextOfTag((Element)n);
        }
    }
}

Output

This is the first text

 More text here Another line of text 

Text in the span



Another text in span

 This is another line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM