简体   繁体   中英

Convert XML to Data Frame in R

Hi I am trying to convert the following XML code to a data frame in R. However I can't because there are values missing per record.

RecordID 23063 has the following data with it ActivityCreatedDate, ExpectedInstallDate, InvoiceTxnDate. However some the following nodes do not have all of these elements to them. RecordID 23321 is missing InvoiceTxnDate, etc.

<?xml version="1.0" encoding="windows-1252" ?>
  <Record>
    <RecordID>23063</RecordID>
    <ActivityCreatedDate>2018-12-11T19:00:00</ActivityCreatedDate>
    <ExpectedInstallDate>2018-12-19T19:00:00</ExpectedInstallDate>
    <InvoiceTxnDate>2018-12-13T19:00:00</InvoiceTxnDate>
  </Record>
  <Record>
    <RecordID>23321</RecordID>
    <ActivityCreatedDate>2018-10-15T18:00:00</ActivityCreatedDate>
    <ExpectedInstallDate>2018-11-14T19:00:00</ExpectedInstallDate>
  </Record>
  <Record>
    <RecordID>23566</RecordID>
    <ActivityCreatedDate>2019-01-23T19:00:00</ActivityCreatedDate>
  </Record>
  <Record>
    <RecordID>23217</RecordID>
    <ActivityCreatedDate>2018-12-20T19:00:00</ActivityCreatedDate>
    <ExpectedInstallDate>2019-01-23T19:00:00</ExpectedInstallDate>
    <InvoiceTxnDate>2019-01-18T19:00:00</InvoiceTxnDate>
  </Record>
  <Record>
    <RecordID>23325</RecordID>
    <ActivityCreatedDate>2018-05-25T18:00:00</ActivityCreatedDate>
    <ExpectedInstallDate>2019-01-23T19:00:00</ExpectedInstallDate>
  </Record>
</end of file>

currently I am using xml2. I am using read_xml to read it to a variable, and then xml_find_all and trimws to store the column to a list. I then attempt to turn my list into a data frame, but it fails because the dimensions are off.

I want to know how I can turn the above XML into a data frame that looks like this:

RecordID    ActivityCreatedDate ExpectedInstallDate InvoiceTxnDate
1   23063   2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2   23321   2018-10-15T18:00:00 2018-11-14T19:00:00 NA
3   23566   2019-01-23T19:00:00 NA                  NA
4   23217   2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5   23325   2018-05-25T18:00:00 2019-01-23T19:00:00 NA

Is there a way to loop through each RecordID in this case and add a

<InvoiceTxnDate>NA</InvoiceTxnDate> or a <ExpectedInstallDate>NA</ExpectedInstallDate>

to the node if its missing? I'd be more then happy to share the R code I have for data that's all uniform. Also if this question does not make sense please let me know and I will explain myself more.

Have you tried using the XML package?

XML::xmlToDataFrame('path to xml file')


> XML::xmlToDataFrame('~/R/test.xml')
  RecordID ActivityCreatedDate ExpectedInstallDate      InvoiceTxnDate
1    23063 2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2    23321 2018-10-15T18:00:00 2018-11-14T19:00:00                <NA>
3    23566 2019-01-23T19:00:00                <NA>                <NA>
4    23217 2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5    23325 2018-05-25T18:00:00 2019-01-23T19:00:00                <NA>

In the case that the XML is exactly as shown above, with no root node. You can do the following:

library(xml2)
library(rvest)
library(tidyverse)

## METHOD 1
## add missing root node
read_html('~/R/test.xml') %>% html_children() %>% 
  as_xml_document(root = 'doc') %>% xml_contents() %>% xml_contents() %>% 
  map_df(., function(x) {
    kids <- xml_children(x)
    setNames(as.list(type.convert(xml_text(kids))), xml_name(kids))
  })

## METHOD 2
## treating the xml as a list
read_html('~/R/test.xml') %>% 
  html_nodes('record') %>% 
  as_list() %>% 
  lapply(., function(x) unlist(x, recursive = F) %>% bind_cols()) %>% 
  bind_rows()


## both of the above methods will return the following tibble
# A tibble: 5 x 4
  recordid activitycreateddate expectedinstalldate invoicetxndate     
  <chr>    <chr>               <chr>               <chr>              
1 23063    2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2 23321    2018-10-15T18:00:00 2018-11-14T19:00:00 NA                 
3 23566    2019-01-23T19:00:00 NA                  NA                 
4 23217    2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5 23325    2018-05-25T18:00:00 2019-01-23T19:00:00 NA  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM