簡體   English   中英

使用JSOUP從html表中提取非結構化數據

[英]Extracting Unstructured data from html table using JSOUP

我一直在嘗試通過jSoup使用此代碼。 這個想法是從此頁面中提取電影時間表:

http://www.blitzmegaplex.com/en/schedule_movie.php?id=MOV1970

到目前為止,我只能單獨提取電影院的名稱。 由於它被標記有特定的類名(“ separator2”)。 其余的稱為“分隔符”。

我正在嘗試使用for循環建立以下步驟:對於TABLE中的每個ROW:

  1. 獲取電影標題
  2. 跳過它下面的一行(從步驟1行開始)。
  3. 使用名為“ separator”的類獲取第二個
  4. 從其下面的所有位置獲取第二個(從步驟3行開始)。 直到到達包含名為“ separator2”的類的下一行
  5. 重復該過程,直到處理完所有行。

誰能建議我該如何進行呢? 還是更好的建議?

謝謝。

到目前為止,我的代碼:

public void getMovieSchedule(String movieUrl) throws IOException
{


    //URL url = new URL(movieUrl);
    //Document doc = Jsoup.parse(url, 3000);

    //Element table = doc.select("table[div=scheduletbl]").first();
    //Iterator<Element> ite = table.select("tr").iterator();
    //ite.next(); // Skip the first row.

    // Actual content
    //print(ite.next().text());

    *** CODE ABOVE DOES NOT WORK ***

    //final String urlSchedule = "http://www.blitzmegaplex.com/en/schedule_movie.php?id=MOV1970";

    Document doc = Jsoup.connect(movieUrl).get();
    Elements div = doc.select("div.panelbox");

    for(Element child : div)
    {
        Elements table = child.select("table");
        Elements row = table.select("tr"); // The actual content.

        for (Element a: row)
        {
            Elements cinemaName = a.select("td.separator2");
            print(cinemaName.text().toString());
        }
    }
}

要提取的HTML(省略了一些代碼):

<table width="95%" border="0" cellpadding="2" cellspacing="0" id="scheduletbl">
    <tbody>

    <tr>
    <td colspan="3" class="separator2"><strong>BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG</strong></td>
    </tr>

    <tr>
    <td colspan="3"><img src="../img/ico_rss_schedule_white.gif" width="16" height="16" hspace="5" align="left"><strong><a href="../rss/schedule.php" class="navlink">RSS- Paris van Java</a></strong></td>
    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td colspan="2" class="separator">TUESDAY, 05 NOVEMBER 2013</td>
    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    10:30&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=10:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    13:15&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=13:15&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    16:00&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=16:00&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    18:45&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=18:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    21:30&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0100&amp;movie=MOV1970&amp;showtime=21:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td colspan="3" class="separator2"><strong>BLITZMEGAPLEX - GRAND INDONESIA, JAKARTA</strong></td>
    </tr>

    <tr>
    <td colspan="3"><img src="../img/ico_rss_schedule_white.gif" width="16" height="16" hspace="5" align="left"><strong><a href="../rss/schedule.php" class="navlink">RSS- Grand Indonesia</a></strong></td>
    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td colspan="2" class="separator">TUESDAY, 05 NOVEMBER 2013</td>
    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    10:45&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=10:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    13:30&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=13:30&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    16:15&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=16:15&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    19:00&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=19:00&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>

    </tr>
    <tr>
    <td class="separator">&nbsp;</td>
    <td width="20%" class="separator" rel="2D">
    21:45&nbsp;&nbsp;&nbsp;
    </td>
    <td width="30%" class="separator">
    <a href="https://www.blitzmegaplex.com/olb/seats.php?showdate=2013-11-05&amp;cinema=0200&amp;movie=MOV1970&amp;showtime=21:45&amp;suite=N&amp;movieformat=2D" class="navlink" target="_blank">Buy Tickets</a></td>
    </tr>
    ... MORE <tr> here ...
    </tbody></table>

如果我正確理解了您的問題,則只想從表中提取一些詳細信息(即電影院名稱,日期和時間),但是您會遇到麻煩,因為大多數行具有相同的className。

因此,基於此,這是我的解決方案:

Elements e = doc.select("table#scheuletbl > tbody > tr > td");
for (Element el : e) {
    if (el.hasClass("separator2")) System.out.println(el.text()); // cinema name
    else if (el.toString().contains("colspan=\"2\"")) System.out.println(el.text()); // date
    else if (el.hasAttr("rel")) System.out.println(el.text()); // times
}

將打印出:

BLITZMEGAPLEX - PARIS VAN JAVA, BANDUNG
TUESDAY, 05 NOVEMBER 2013
10:30   
13:15   
16:00   
18:45   
21:30   
BLITZMEGAPLEX - GRAND INDONESIA, JAKARTA
TUESDAY, 05 NOVEMBER 2013
10:45   
13:30   
16:15   
19:00   
21:45 

當然,此解決方案與該網站上的特定表高度耦合,但是只要該格式不經常更改且在該網站上保持一致,它就可以工作。 您可能考慮創建一個類來存儲所有這些信息。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM