用Jsoup解析HTML

Question

I'm trying to do some parsing and I'm stuck... Here's the structure of HTML: 我正在尝试进行一些分析，但遇到了麻烦……这是HTML的结构：

<ul class="sub-menu"> 
<li id="1" class="1"><a href="http://link">SOME TEXT</a> 
    <ul class="sub-menu"> 
        <li .... ><a ... /></li>
        <li .... ><a ... /></li>
        <li .... ><a ... /></li>
    </ul>
</li>
<li id="2" class="2"><a href="http://link2">SOME OTHER TEXT</a> 
    <ul class="sub-menu"> 
        <li .... ><a ... /></li>
        <li .... ><a ... /></li>
        <li .... ><a ... /></li>
    </ul>
</li></ul>

I need to get each li (id = 1, 2 and s) and then lis inside them ( <li .... ><a ... /></li> ). 我需要获取每个li（id = 1、2和s），然后在其中包含lis（ <li .... ><a ... /></li> ）。

Here's how my Java looks: 这是我的Java外观：

// ul contains the source above
Elements lis = ul.select("li"); // I know that this line screws up everything here, but I can't figure out how to do it correctly
for(Element li: lis)
{
    String text = li.select("a").first().text();
    Elements lis2 = li.select("ul[class=sub-menu]").first().getElementsByTag("li");     
    for(Element li2: lis2)
    {
        Element a = li2.select("a").first();
        // and other stuff with 'a'
    }
}

So can anybody help me to solve this problem? 那么有人可以帮助我解决这个问题吗？

EDIT: The problem is that ul.select("li"); 编辑：问题是ul.select("li"); returns every single 'li' in source I wrote here. 返回我在此处编写的源代码中的每个“ li”。 I need to get lis with id 1, 2 and so on. 我需要获取ID为1、2等的lis。 And then I need to get those <li .... ><a ... /></li> . 然后我需要获取那些<li .... ><a ... /></li> 。 PS Sorry for my bad English. PS对不起，我的英语不好。

Answer 1

I'm not sure, but try something like this 我不确定，但是尝试这样的事情

for( Element element : doc.select("[li]") )
{
    if( element.attr("id")== 1 || element.attr("id").getValue()== 2 )
    {
        // thats your elements 'element'
        System.out.println(element);
    }
}

Regards, Hugo Pedrosa 问候，雨果·佩德罗萨

Answer 2

Have you tried 你有没有尝试过

`ul.children()`

I think that it will return only the immediate children nodes of ul . 我认为它将仅返回ul的直接子节点。

Answer 3

Use the comparing methods built in JSoup, such as <, > etc. 使用JSoup中内置的比较方法，例如<，>等。

You can select elements by including a pseudo selector that will look at the relative position in the DOM structure relative to it's parent: 您可以通过包含一个伪选择器来选择元素，该伪选择器将查看DOM结构相对于其父级的相对位置：

Elements lis = ul.select("li:lt(2)");

which should result in only returning the li's 0 and 1. 这应该导致仅返回li的0和1。

Please refer to the JSoup documentation for pseudo selectors which explains this better than I can! 请参考JSoup文档中的伪选择器，它比我能更好地解释了这一点！

http://jsoup.org/cookbook/extracting-data/selector-syntax http://jsoup.org/cookbook/extracting-data/selector-syntax

用Jsoup解析HTML

问题描述

3 个解决方案

解决方案1
0 2013-05-14 15:50:52

解决方案2
0 2013-05-14 19:43:33

解决方案3
0 2013-05-15 06:51:47

用Jsoup解析HTML

问题描述

3 个解决方案

解决方案1 0 2013-05-14 15:50:52

解决方案2 0 2013-05-14 19:43:33

解决方案3 0 2013-05-15 06:51:47

解决方案1
0 2013-05-14 15:50:52

解决方案2
0 2013-05-14 19:43:33

解决方案3
0 2013-05-15 06:51:47