简体   繁体   English

使用java中的JSoup解析没有ID的html表

[英]Parsing an html table without ID with JSoup in java

I am trying to process a large amount of data for a research project. 我正在尝试为研究项目处理大量数据。 I have a number of html files on my computer and I need to read some information into a java program. 我的计算机上有许多html文件,我需要将一些信息读入java程序。

I use Jsoup to load the document. 我使用Jsoup来加载文档。

Unfortunately the table in the html has no class or id (and there are multiple tables). 不幸的是,html中的表没有类或id(并且有多个表)。 I have searched stack, but all answers I find use table.class. 我搜索过堆栈,但我找到的所有答案都使用table.class。

How could I get the data (18/01/2014) from the table below? 我如何从下表中获取数据(18/01/2014)? The doc.select is not working now, because of the missing class I think doc.select现在不能正常工作,因为我认为缺少课程

I am trying something like this:

    Element table = doc.select("table").first();

            Iterator<Element> ite = table.select("td").iterator();

            ite.next(); 

            System.out.println("Value 1: " + ite.next().text());
            System.out.println("Value 2: " + ite.next().text());
            System.out.println("Value 3: " + ite.next().text());
            System.out.println("Value 4: " + ite.next().text());




<table border=0 cellpadding=0 cellspacing=0 width=650 height=18><tr><td class="header" style="color:#FFFFFF;"><table border=0 cellpadding=0 cellspacing=0><tr>
<td><img src="/images/title_ultratop.png"></td><td style="color:#FFFFFF;vertical-align:middle;"><b>50 DANCE<br> 
<a href="link"><img src="/images/arr_bw.png" border=0 style="margin-bottom:1px;margin-right:3px;"></a>18/01/2014
</b></td></tr></table>

-- EDIT - 编辑

I found the table was inside another table. 我发现桌子在另一张桌子里面。 Using this code I could get it, BUT I only get 1 line now. 使用这段代码我可以得到它,但我现在只得到1行。 Just the table, I need to get one element out of it still. 就在桌子上,我需要从中获取一个元素。

 Element table = doc.select("table table").first();

            for (Element row : table.select("tr")) {
                    Elements tds = row.select("td");
                     System.out.println(tds.get(0).text());

            }

I guess I am displaying an entire table now. 我想我现在正在显示整个表格。 How to get the let's say 2nd element? 如何让我们说第二个元素?

There are some problems in your html. 您的HTML中存在一些问题。 I suppose the correct one is: 我想正确的是:

<table border="1" cellpadding="0" cellspacing="0" width="650" height="18">
    <tr>
        <td class="header" style="color:#FFFFFF;">
            <table border="1" cellpadding="0" cellspacing="0">
                <tr>
                    <td><img src="/images/title_ultratop.png"></td>
                    <td style="color:#FFFFFF;vertical-align:middle;">
                        <b>50 DANCE
                        <br>
                        <a href="link"><img src="/images/arr_bw.png" border="0"
                                            style="margin-bottom:1px;margin-right:3px;"></a>
                        18/01/2014
                        </b>
                    </td>
                </tr>
            </table>
        </td>
    </tr>
</table>

In order to get that node you have to select: table table td b and then get the 4th child node (a text node): 为了获得该节点,您必须选择:table table td b然后获取第4个子节点(文本节点):

    Elements td = doc.select("table table td b");
    TextNode el = (TextNode)td.first().childNode(4);
    System.out.println(el.text());

Right,a third embedded table and it works. 对,第三个嵌入式表,它的工作原理。

Element table = doc.select("table table").first(); 元素表= doc.select(“table table”)。first();

Still need to select a different table on the site as well. 仍然需要在网站上选择不同的表格。 I read about table:contains(word). 我读了关于table:contains(word)。 Hope that will word! 希望能说出来!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM