简体   繁体   中英

Parsing XML to data.frame in R

Lots of questions on this, but can't find solution suiting this data format. Grateful for advice on how to parse this:

<XML>
<constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
    <name text="Aberavon"/>
</constituency>
<constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
    <name text="Aberdeen Central"/>
</constituency>
<constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
    <name text="Aberdeen North"/>
</constituency>
</XML>

The desired fields are evidently c('hansard_id','id','fromdate','todate','name') . To read in and parse I've tried the following:

require(XML)
> indata = htmlParse('data.xml', isHTML=F)
> class(indata)
[1] "XMLInternalDocument" "XMLAbstractDocument"
> print(indata)
<?xml version="1.0"?>
<XML>
  <constituency hansard_id="5" id="uk.org.publicwhip/cons/1" fromdate="1918" todate="9999-12-31">
    <name text="Aberavon"/>
  </constituency>
  <constituency hansard_id="6" id="uk.org.publicwhip/cons/2" fromdate="1997-05-01" todate="2005-05-04">
    <name text="Aberdeen Central"/>
  </constituency>
  <constituency hansard_id="7" id="uk.org.publicwhip/cons/3" fromdate="1885" todate="9999-12-31">
    <name text="Aberdeen North"/>
  </constituency>
</XML>

> xmlToDataFrame(indata, stringsAsFactors=F)
  name
1     
2     
3     

It's reading in ok, but xmlToDataFrame can't handle the format. Is it because the data are attributes to the 'constituency' tag elements? Very grateful for any guidance.

You are correct that xmlToDataFrame only access the XML nodes. For a given node the xmlAttrs function will return that nodes attributes. The xpathApply function takes a parsed xml document doc say and applies an xpath to it to get a set of nodes. Each of these nodes is then applied to a function which a user defines. The xpath "//*/constituency" will return all the constituency nodes in your document. We can then apply the xmlAttrs function to each:

res <- xpathApply(doc, "//*/constituency", xmlAttrs)

This will return us a list of attributes. We would like to bind these together for example:

rbind.data.frame(res[[1]], res[[2]], ...)

would bind the first and second, third, ... set of attributes together into a data.frame. A short way of doing this is to use the do.call function on out list of attributes:

do.call(rbind.data.frame, res)

will apply the row bind to all the elements of our list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM