简体   繁体   English

如何使用 Jsoup 获取孤立文本?

[英]How to get orphaned text with Jsoup?

I have an html:我有一个 html:

<span>This is the first text</span>
More text here 
Another line of text
<span>Text in the span</span>
<span>Another text in span</span>
This is another line

I want to get all the texts in order, something like this array:我想按顺序获取所有文本,类似于这个数组:

[
"Span:This is the first text",
"More text here",
"Another line of text",
"Span:Text in the span",
"Span:Another text in span",
"This is another line",
]

I would go with a recursive method that takes your starting tag and iterates over its child nodes.我会使用一种递归方法,该方法采用您的起始标记并迭代其子节点。 For each TextNode, print the contents.对于每个 TextNode,打印内容。 For each Element, check it for child nodes.对于每个元素,检查它的子节点。

public static void main(String[] args) throws ParseException, IOException
{
    //I put your HTML in the body tag in a local file
    Document doc = Jsoup.parse(new File("input/20160505.html"), "UTF-8");
    Elements elements = doc.getElementsByTag("body");
    Element rootTag = elements.get(0);
    printTextOfTag(rootTag);
}

public static void printTextOfTag(Element currentTag)
{
    List<Node> nodes = currentTag.childNodes();
    for(Node n : nodes)
    {
        if(n instanceof TextNode)
        {
            System.out.println(((TextNode)n).text());
        }
        else if(n instanceof Element)
        {
            printTextOfTag((Element)n);
        }
    }
}

Output输出

This is the first text

 More text here Another line of text 

Text in the span



Another text in span

 This is another line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM