简体   繁体   English

使用readHTMLTable与多个tbody

[英]Using readHTMLTable with multiple tbody

Suppose I have an HTML table with multiple <tbody> , which we know is perfectly legal HTML , and attempt to read it with readHTMLTable as follows: 假设我有一个包含多个<tbody>的HTML表, 我们知道它是完全合法的HTML ,并尝试使用readHTMLTable读取它,如下所示:

library(XML)
table.text <- '<table>
  <thead>
    <tr><th>Col1</th><th>Col2</th>
  </thead>
  <tbody>
    <tr><td>1a</td><td>2a</td></tr>
  </tbody>
  <tbody>
    <tr><td>1b</td><td>2b</td></tr>
  </tbody>
</table>'
readHTMLTable(table.text)

The output I get only takes the first <tbody> element: 我得到的输出只接受第一个<tbody>元素:

$`NULL`
  Col1 Col2
1   1a   2a

and ignores the rest. 并忽略其余的。 Is this expected behavior? 这是预期的行为吗? (I can't find any mention in the documentation.) And what are the most flexible and robust ways to access the entire table? (我在文档中找不到任何提及。) 访问整个表格的最灵活和最强大的方法是什么?

I'm currently using 我正在使用

table.text <- gsub('</tbody>[[:space:]]*<tbody>', '', table.text)
readHTMLTable(table.text)

which prevents me from using readHTMLTable directly on a URL to get a table like this, and also doesn't feel very robust. 这阻止我直接在URL上使用readHTMLTable来获取这样的表,并且也感觉不太健壮。

If you look at the source for readHTMLTable getMethod(readHTMLTable, "XMLInternalElementNode") it contains the line 如果你查看readHTMLTable getMethod(readHTMLTable, "XMLInternalElementNode")的源代码,它就包含了这一行

    if (length(tbody)) 
        node = tbody[[1]]

so it is purposefully designed to select only the content of the first tbody. 所以它有目的地设计为只选择第一个tbody的内容。 Also ?readHTMLTable describes the function as providing 另外?readHTMLTable将功能描述为提供

somewhat robust methods for extracting data from HTML tables in an HTML document 用于从HTML文档中的HTML表中提取数据的有些强大的方法

It is designed to be a utility function. 它被设计成一个实用功能。 Its great when it works but you may need to hack around it. 当它工作时很棒,但你可能需要破解它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM