简体   繁体   中英

Using readHTMLTable with multiple tbody

Suppose I have an HTML table with multiple <tbody> , which we know is perfectly legal HTML , and attempt to read it with readHTMLTable as follows:

library(XML)
table.text <- '<table>
  <thead>
    <tr><th>Col1</th><th>Col2</th>
  </thead>
  <tbody>
    <tr><td>1a</td><td>2a</td></tr>
  </tbody>
  <tbody>
    <tr><td>1b</td><td>2b</td></tr>
  </tbody>
</table>'
readHTMLTable(table.text)

The output I get only takes the first <tbody> element:

$`NULL`
  Col1 Col2
1   1a   2a

and ignores the rest. Is this expected behavior? (I can't find any mention in the documentation.) And what are the most flexible and robust ways to access the entire table?

I'm currently using

table.text <- gsub('</tbody>[[:space:]]*<tbody>', '', table.text)
readHTMLTable(table.text)

which prevents me from using readHTMLTable directly on a URL to get a table like this, and also doesn't feel very robust.

If you look at the source for readHTMLTable getMethod(readHTMLTable, "XMLInternalElementNode") it contains the line

    if (length(tbody)) 
        node = tbody[[1]]

so it is purposefully designed to select only the content of the first tbody. Also ?readHTMLTable describes the function as providing

somewhat robust methods for extracting data from HTML tables in an HTML document

It is designed to be a utility function. Its great when it works but you may need to hack around it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM