简体   繁体   English

使用Jsoup解析嵌套的HTML无序列表

[英]Parsing nested HTML unordered lists with Jsoup

I am parsing an HTML file with nested unordered lists, here is an example: 我正在解析带有嵌套无序列表的HTML文件,这是一个示例:

<ul>
    <li class="category_x">xyz abc
        <ul>
            <li>foo 123 bar</li>
            <li>456 bar foo</li>
        </ul>
    </li>
    <li class="category_x">aaa bbb ccc
        <ul>
            <li>xxx yyy zzz</li>
            <li>123 abc 456</li>
        </ul>
    </li>
</ul>

I am interested in the relationship li > ul > li (think at it as Jsoup objects of type Element : grandParentNode > parentNode > eNode ), but using the method grandParentNode.text() I am getting also the text in the whole nested <ul> list (included eNode.text() ). 我对关系li > ul > li (将其视为Element类型的Jsoup对象: grandParentNode > parentNode > eNodegrandParentNode.text() ,但是使用方法grandParentNode.text()我也得到了整个嵌套<ul>的文本<ul>列表(包含eNode.text() )。

    // getting the triplets
    Elements triplets = doc.select("li > ul > li");

    // print the triplet
    for (Element eNode : triplets)
    {
        Element parentNode = eNode.parent();
        Element grandParentNode = parentNode.parent();

        System.out.println("Current node: " + eNode.text());
        System.out.println("Grand parent: " + grandParentNode.text());
    }

The output is: 输出为:

Current node: foo 123 bar
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: 456 bar foo
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456
Current node: 123 abc 456
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456

I would like it to be: 我希望它是:

Current node: foo 123 bar
Grand parent: xyz abc
Current node: 456 bar foo
Grand parent: xyz abc
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc
Current node: 123 abc 456
Grand parent: aaa bbb ccc

Having a look at the Jsoup documentation it seems I need to modify the HTML in order to get those strings included in something like a value="" attribute, but I can not modify the HTML... On top of this all those <li class="category_x"> are repeated everywhere with the same value on every node which is not a " li leaf" of the tree, so they are not really helpful in filtering data. 看一下Jsoup文档,看来我需要修改HTML才能使这些字符串包含在诸如value=""属性之类的东西中,但是我无法修改HTML ...最重要的是,所有这些<li class="category_x">在每个节点上重复使用相同的值,而不是树的“ li叶子”,因此它们在过滤数据方面并没有真正的帮助。

I have already tried stuff like doc.select("li:lt(1) > ul > li"); 我已经尝试过诸如doc.select("li:lt(1) > ul > li"); but it's not working, the problem is the structure of the HTML and how I am using the method text() from the Element class of Jsoup. 但它不起作用,问题在于HTML的结构以及我如何使用Jsoup的Element类中的方法text() The thing is that I have no idea of how to avoid text() . 问题是我不知道如何避免text()

Any idea? 任何想法?

Thanks 谢谢

Use the ownText() method to select only the text owned directly by an Element, ignoring the text of any child Elements. 使用ownText()方法仅选择元素直接拥有的文本,而忽略任何子元素的文本。

So change this line: 因此更改此行:

System.out.println("Grand parent: " + grandParentNode.text());

to

System.out.println("Grand parent: " + grandParentNode.ownText());

The output will now show: 现在,输出将显示:

Current node: foo 123 bar
Grand parent: xyz abc
Current node: 456 bar foo
Grand parent: xyz abc
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc
Current node: 123 abc 456
Grand parent: aaa bbb ccc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM