简体   繁体   中英

Parsing an html table without ID with JSoup in java

I am trying to process a large amount of data for a research project. I have a number of html files on my computer and I need to read some information into a java program.

I use Jsoup to load the document.

Unfortunately the table in the html has no class or id (and there are multiple tables). I have searched stack, but all answers I find use table.class.

How could I get the data (18/01/2014) from the table below? The doc.select is not working now, because of the missing class I think

I am trying something like this:

    Element table = doc.select("table").first();

            Iterator<Element> ite = table.select("td").iterator();

            ite.next(); 

            System.out.println("Value 1: " + ite.next().text());
            System.out.println("Value 2: " + ite.next().text());
            System.out.println("Value 3: " + ite.next().text());
            System.out.println("Value 4: " + ite.next().text());




<table border=0 cellpadding=0 cellspacing=0 width=650 height=18><tr><td class="header" style="color:#FFFFFF;"><table border=0 cellpadding=0 cellspacing=0><tr>
<td><img src="/images/title_ultratop.png"></td><td style="color:#FFFFFF;vertical-align:middle;"><b>50 DANCE<br> 
<a href="link"><img src="/images/arr_bw.png" border=0 style="margin-bottom:1px;margin-right:3px;"></a>18/01/2014
</b></td></tr></table>

-- EDIT

I found the table was inside another table. Using this code I could get it, BUT I only get 1 line now. Just the table, I need to get one element out of it still.

 Element table = doc.select("table table").first();

            for (Element row : table.select("tr")) {
                    Elements tds = row.select("td");
                     System.out.println(tds.get(0).text());

            }

I guess I am displaying an entire table now. How to get the let's say 2nd element?

There are some problems in your html. I suppose the correct one is:

<table border="1" cellpadding="0" cellspacing="0" width="650" height="18">
    <tr>
        <td class="header" style="color:#FFFFFF;">
            <table border="1" cellpadding="0" cellspacing="0">
                <tr>
                    <td><img src="/images/title_ultratop.png"></td>
                    <td style="color:#FFFFFF;vertical-align:middle;">
                        <b>50 DANCE
                        <br>
                        <a href="link"><img src="/images/arr_bw.png" border="0"
                                            style="margin-bottom:1px;margin-right:3px;"></a>
                        18/01/2014
                        </b>
                    </td>
                </tr>
            </table>
        </td>
    </tr>
</table>

In order to get that node you have to select: table table td b and then get the 4th child node (a text node):

    Elements td = doc.select("table table td b");
    TextNode el = (TextNode)td.first().childNode(4);
    System.out.println(el.text());

Right,a third embedded table and it works.

Element table = doc.select("table table").first();

Still need to select a different table on the site as well. I read about table:contains(word). Hope that will word!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM