简体   繁体   English

将XML解析为R中的data.frame

[英]Parsing XML to data.frame in R

Lots of questions on this, but can't find solution suiting this data format. 对此有很多疑问,但找不到适合这种数据格式的解决方案。 Grateful for advice on how to parse this: 感谢有关如何解析此问题的建议:

<XML>
<constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
    <name text="Aberavon"/>
</constituency>
<constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
    <name text="Aberdeen Central"/>
</constituency>
<constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
    <name text="Aberdeen North"/>
</constituency>
</XML>

The desired fields are evidently c('hansard_id','id','fromdate','todate','name') . 所需的字段显然是c('hansard_id','id','fromdate','todate','name') To read in and parse I've tried the following: 读入和解析我尝试了以下内容:

require(XML)
> indata = htmlParse('data.xml', isHTML=F)
> class(indata)
[1] "XMLInternalDocument" "XMLAbstractDocument"
> print(indata)
<?xml version="1.0"?>
<XML>
  <constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
    <name text="Aberavon"/>
  </constituency>
  <constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
    <name text="Aberdeen Central"/>
  </constituency>
  <constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
    <name text="Aberdeen North"/>
  </constituency>
</XML>

> xmlToDataFrame(indata, stringsAsFactors=F)
  name
1     
2     
3     

It's reading in ok, but xmlToDataFrame can't handle the format. 它正在读取,但xmlToDataFrame无法处理格式。 Is it because the data are attributes to the 'constituency' tag elements? 是因为数据是'constituency'标签元素的属性吗? Very grateful for any guidance. 非常感谢任何指导。

You are correct that xmlToDataFrame only access the XML nodes. 你是对的, xmlToDataFrame只访问XML节点。 For a given node the xmlAttrs function will return that nodes attributes. 对于给定节点, xmlAttrs函数将返回该节点属性。 The xpathApply function takes a parsed xml document doc say and applies an xpath to it to get a set of nodes. xpathApply函数接受解析的xml文档doc say并将xpath应用于它以获取一组节点。 Each of these nodes is then applied to a function which a user defines. 然后将这些节点中的每一个应用于用户定义的功能。 The xpath "//*/constituency" will return all the constituency nodes in your document. xpath "//*/constituency"将返回文档中的所有constituency节点。 We can then apply the xmlAttrs function to each: 然后我们可以将xmlAttrs函数应用于每个:

res <- xpathApply(doc, "//*/constituency", xmlAttrs)

This will return us a list of attributes. 这将返回一个属性列表。 We would like to bind these together for example: 我们想将这些绑定在一起,例如:

rbind.data.frame(res[[1]], res[[2]], ...)

would bind the first and second, third, ... set of attributes together into a data.frame. 将第一个和第二个,第三个,......组的属性绑定到data.frame中。 A short way of doing this is to use the do.call function on out list of attributes: 这样做的一个简单方法是在out属性列表中使用do.call函数:

do.call(rbind.data.frame, res)

will apply the row bind to all the elements of our list. 将行绑定应用于列表的所有元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM