简体   繁体   中英

Parsing XML into an R data frame

Im stuck trying to parse a big xml-file into an R - data.frame object. The xml has the following schema:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?eclipse version="3.0"?>
  <ROOT>
  <row>
    <field name="dtcreated"></field>
    <field name="headline"></field>
    <subheadline/>
    <field name="body"></field>
  </row>
  <row>
    <field name="dtcreated"></field>
    <field name="headline"></field>
    <subheadline/>
    <field name="body"></field>
  </row>
</ROOT>

the plyr convenience functions didn't help, since the xml couldn't be validated. So I came up with the following code, using xpath queries:

adHocXml<-xmlTreeParse(adHocXmlPath,getDTD = FALSE)
adHocRoot<-xmlRoot(adHocXml)
creationDateColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='dtcreated']"), xmlValue)
headlineColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='headline']"), xmlValue)
bodyColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='body']"), xmlValue)
adHocData<-data.frame(creationDate=creationDateColumn,headline=headlineColumn,body=bodyColumn)

The code does exactly what I expect it to do for a short file. With a large file and several thousand row-tags however, I get the following error after about 10 minutes:

Error: 1: internal error: Huge input lookup
2: Extra content at the end of the document 

Can anyone help me?

libxml has an upper limit on the size a single node can be. You can turn this limit off by enabling the parser flag XML_PARSE_HUGE . In R package XML you would do this as:

library(XML)
xmlParse(myXML, options = HUGE)

You may also want to look at xmlEventParse . Martin Morgan provides a good example on its use here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM