删除XML包中readHTMLTable中的标记

Question

I'm trying to scrape data from the table at the following url: 我正在尝试从以下网址的表中抓取数据：

http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033

The problem is the superscripts contained within the 问题是包含在内的上标

<sup> </sup>

tags. 标签。 When I use the following code (admittedly not very elegant) 当我使用以下代码时（诚然不是很优雅）

url.overview <- "http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033"
overview <- readHTMLTable(overview)
overview <- overview[[2]]
overview <- overview[-1,]

f <- function(x){
  out <- iconv(x, "latin1", "ASCII", sub="")
  out <- gsub('[\\$,]', '', out) 
  out <- as.numeric(out)
  return(out)
}

overview <- matrix(f(as.character(unlist(overview))), ncol = ncol(overview))
overview <- as.data.frame(overview)
names(overview) <- c('year', 'fires', 'civ.deaths', 'civ.injuries', 'ff.deaths',
                     'ff.injuries', 'damage.reported', 'damage.2010dollars')

I get exactly what I want except that the values in the superscripts are appended to the end of the values in the table cells. 我得到了我想要的东西，除了上标中的值被附加到表格单元格中的值的末尾。 For example, (using the row and column names from the url given above) Civilian Deaths in 2001 are stored as 61963 when they should be 6196 since the superscript 3 is interpreted as an extra digit. 例如，（使用上面给出的URL中的行和列名称）2001年的平民死亡当它们应该是6196时存储为61963，因为上标3被解释为额外的数字。 Any cells in the table that lack a superscript come out exactly correct. 表中缺少上标的任何单元都完全正确。

After many hours struggling through the documentation, I was able to use the functions parseHTML and getNodeSet from the XML package to identify all of the nodes containing the <sup> tags, but couldn't figure out what to do from there: 经过几个小时努力完成文档后，我能够使用XML包中的函数parseHTML和getNodeSet来识别包含<sup>标签的所有节点，但无法弄清楚该怎么做：

overview <- htmlParse(url.overview)
getNodeSet(overview, "//sup")

I take it I somehow need to remove these parts of the XML tree, then pass the result back to readHTMLTable for further processing but I couldn't figure out how to do this. 我认为我需要删除XML树的这些部分，然后将结果传递回readHTMLTable进行进一步处理，但我无法弄清楚如何执行此操作。

I'd be very grateful for your thoughts. 我非常感谢你的想法。

Answer 1

Try 尝试

require(XML)
url.overview <- "http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033"
overview <- htmlParse(url.overview,encoding="UTF-8")
temp<-getNodeSet(overview, "/*//span[@class=\"small\"]/sup")
removeNodes(temp)
app.data<-readHTMLTable(overview)[[2]]

so here we just remove the nodes we dont want and feed the remainder back into readHTMLTable taking the 2nd table. 所以在这里我们只删除我们不想要的节点，并将剩余部分反馈到readHTMLTable ，然后选择第二个表。 I was having issues with encoding on this windows box. 我在这个Windows框上遇到编码问题。 You may want to leave the encoding in the htmlParse or it might work fine without for you. 您可能希望将编码保留在htmlParse或者它可能无法正常工作。

删除XML包中readHTMLTable中的标记

问题描述

1 个解决方案

解决方案1
4 已采纳 2012-08-22 00:37:11

删除XML包中readHTMLTable中的标记

问题描述

1 个解决方案

解决方案1 4 已采纳 2012-08-22 00:37:11

解决方案1
4 已采纳 2012-08-22 00:37:11