简体   繁体   中英

Parse HTML table with Oracle

I need to parse an HTML table:

<table>
  <tbody>
    <tr class="blue"><td>code</td><td>rate</td></tr>
    <tr class="gray_1"><td><span>USD</span><em>1</em></td><td>476.16</td></tr>
    <tr class="gray_2"><td><span>AUD</span><em>1</em></td><td>327.65</td></tr>
    <tr class="gray_9"><td><span>IRR</span><em>100</em></td><td>1.13</td></tr>
    <tr class="blue"><td>some comment</td><td>some comment</td></tr>
    <tr class="gray_1"><td><span>EUR</span><em>1</em></td><td>526.54</td></tr>
  </tbody>
</table>

and as a result to get:

+------+----+----+------+
|class |code|em  |value |
+------+----+----+------+
|gray_1|USD |   1| 11.11|
|gray_2|AUD |   1| 22.22|
|gray_9|IRR | 100| 33.33|
|gray_1|EUR |   1| 44.44|

I tried to do it like this:

with tbl as
(
    select xmltype('
        <table>
          <tbody>
            <tr class="blue"><td>code</td><td>rate</td></tr>
            <tr class="gray_1"><td><span>USD</span><em>1</em></td><td>476.16</td></tr>
            <tr class="gray_2"><td><span>AUD</span><em>1</em></td><td>327.65</td></tr>
            <tr class="gray_9"><td><span>IRR</span><em>100</em></td><td>1.13</td></tr>
            <tr class="blue"><td>some comment</td><td>some comment</td></tr>
            <tr class="gray_1"><td><span>EUR</span><em>1</em></td><td>526.54</td></tr>
          </tbody>
        </table>
    ') xml_data from dual
)
select
    *
from
    tbl,
    xmltable('//table/tbody/tr'
        passing tbl.xml_data
        columns
            data varchar2(128) path './td'
    )

But this causes an exception:

ORA-19279: XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence - got multi-item sequence

What do I need to do to fix my code?

Your path is looking for a td under the tr ; but there are two, hence the "got multi-item sequence" error you're seeing. You can reference each td tag by its position, as td[1] etc. It's very reliant on the table structure being as expected though.

With this specific example you can do:

with tbl as
(
    select xmltype('
        <table>
          <tbody>
            <tr class="blue"><td>code</td><td>rate</td></tr>
            <tr class="gray_1"><td><span>USD</span><em>1</em></td><td>476.16</td></tr>
            <tr class="gray_2"><td><span>AUD</span><em>1</em></td><td>327.65</td></tr>
            <tr class="gray_9"><td><span>IRR</span><em>100</em></td><td>1.13</td></tr>
            <tr class="blue"><td>some comment</td><td>some comment</td></tr>
            <tr class="gray_1"><td><span>EUR</span><em>1</em></td><td>526.54</td></tr>
          </tbody>
        </table>
    ') xml_data from dual
)
select
    x.class, x.currency, x.amount, to_number(x.rate) as rate
from
    tbl
cross join
    xmltable('/table/tbody/tr'
        passing tbl.xml_data
        columns
            class varchar2(10) path '@class',
            currency varchar2(3) path 'td[1]/span',
            amount number path 'td[1]/em',
            rate varchar2(50) path 'td[2]'
    ) x
where
    x.currency is not null

which gets:

CLASS      CUR     AMOUNT       RATE
---------- --- ---------- ----------
gray_1     USD          1     476.16
gray_2     AUD          1     327.65
gray_9     IRR        100       1.13
gray_1     EUR          1     526.54

It won't take much variation in the HTML to break it though. See this answer for some reasons it is fragile, and why it generally considered unwise to try to parse HTML as XML.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM