Using readHTMLTable with multiple tbody

Question

Suppose I have an HTML table with multiple <tbody> , which we know is perfectly legal HTML , and attempt to read it with readHTMLTable as follows:

library(XML)
table.text <- '<table>
  <thead>
    <tr><th>Col1</th><th>Col2</th>
  </thead>
  <tbody>
    <tr><td>1a</td><td>2a</td></tr>
  </tbody>
  <tbody>
    <tr><td>1b</td><td>2b</td></tr>
  </tbody>
</table>'
readHTMLTable(table.text)

The output I get only takes the first <tbody> element:

$`NULL`
  Col1 Col2
1   1a   2a

and ignores the rest. Is this expected behavior? (I can't find any mention in the documentation.) And what are the most flexible and robust ways to access the entire table?

I'm currently using

table.text <- gsub('</tbody>[[:space:]]*<tbody>', '', table.text)
readHTMLTable(table.text)

which prevents me from using readHTMLTable directly on a URL to get a table like this, and also doesn't feel very robust.

Answer 1

If you look at the source for readHTMLTable getMethod(readHTMLTable, "XMLInternalElementNode") it contains the line

    if (length(tbody)) 
        node = tbody[[1]]

so it is purposefully designed to select only the content of the first tbody. Also ?readHTMLTable describes the function as providing

somewhat robust methods for extracting data from HTML tables in an HTML document

It is designed to be a utility function. Its great when it works but you may need to hack around it.

Using readHTMLTable with multiple tbody

Question

1 answers

solution1
0 ACCPTED 2013-08-20 08:07:29

Using readHTMLTable with multiple tbody

Question

1 answers

solution1 0 ACCPTED 2013-08-20 08:07:29

solution1
0 ACCPTED 2013-08-20 08:07:29