Hi I am trying to convert the following XML code to a data frame in R. However I can't because there are values missing per record.
RecordID 23063 has the following data with it ActivityCreatedDate, ExpectedInstallDate, InvoiceTxnDate. However some the following nodes do not have all of these elements to them. RecordID 23321 is missing InvoiceTxnDate, etc.
<?xml version="1.0" encoding="windows-1252" ?>
<Record>
<RecordID>23063</RecordID>
<ActivityCreatedDate>2018-12-11T19:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2018-12-19T19:00:00</ExpectedInstallDate>
<InvoiceTxnDate>2018-12-13T19:00:00</InvoiceTxnDate>
</Record>
<Record>
<RecordID>23321</RecordID>
<ActivityCreatedDate>2018-10-15T18:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2018-11-14T19:00:00</ExpectedInstallDate>
</Record>
<Record>
<RecordID>23566</RecordID>
<ActivityCreatedDate>2019-01-23T19:00:00</ActivityCreatedDate>
</Record>
<Record>
<RecordID>23217</RecordID>
<ActivityCreatedDate>2018-12-20T19:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2019-01-23T19:00:00</ExpectedInstallDate>
<InvoiceTxnDate>2019-01-18T19:00:00</InvoiceTxnDate>
</Record>
<Record>
<RecordID>23325</RecordID>
<ActivityCreatedDate>2018-05-25T18:00:00</ActivityCreatedDate>
<ExpectedInstallDate>2019-01-23T19:00:00</ExpectedInstallDate>
</Record>
</end of file>
currently I am using xml2. I am using read_xml to read it to a variable, and then xml_find_all and trimws to store the column to a list. I then attempt to turn my list into a data frame, but it fails because the dimensions are off.
I want to know how I can turn the above XML into a data frame that looks like this:
RecordID ActivityCreatedDate ExpectedInstallDate InvoiceTxnDate
1 23063 2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2 23321 2018-10-15T18:00:00 2018-11-14T19:00:00 NA
3 23566 2019-01-23T19:00:00 NA NA
4 23217 2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5 23325 2018-05-25T18:00:00 2019-01-23T19:00:00 NA
Is there a way to loop through each RecordID in this case and add a
<InvoiceTxnDate>NA</InvoiceTxnDate> or a <ExpectedInstallDate>NA</ExpectedInstallDate>
to the node if its missing? I'd be more then happy to share the R code I have for data that's all uniform. Also if this question does not make sense please let me know and I will explain myself more.
Have you tried using the XML
package?
XML::xmlToDataFrame('path to xml file')
> XML::xmlToDataFrame('~/R/test.xml')
RecordID ActivityCreatedDate ExpectedInstallDate InvoiceTxnDate
1 23063 2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2 23321 2018-10-15T18:00:00 2018-11-14T19:00:00 <NA>
3 23566 2019-01-23T19:00:00 <NA> <NA>
4 23217 2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5 23325 2018-05-25T18:00:00 2019-01-23T19:00:00 <NA>
In the case that the XML is exactly as shown above, with no root node. You can do the following:
library(xml2)
library(rvest)
library(tidyverse)
## METHOD 1
## add missing root node
read_html('~/R/test.xml') %>% html_children() %>%
as_xml_document(root = 'doc') %>% xml_contents() %>% xml_contents() %>%
map_df(., function(x) {
kids <- xml_children(x)
setNames(as.list(type.convert(xml_text(kids))), xml_name(kids))
})
## METHOD 2
## treating the xml as a list
read_html('~/R/test.xml') %>%
html_nodes('record') %>%
as_list() %>%
lapply(., function(x) unlist(x, recursive = F) %>% bind_cols()) %>%
bind_rows()
## both of the above methods will return the following tibble
# A tibble: 5 x 4
recordid activitycreateddate expectedinstalldate invoicetxndate
<chr> <chr> <chr> <chr>
1 23063 2018-12-11T19:00:00 2018-12-19T19:00:00 2018-12-13T19:00:00
2 23321 2018-10-15T18:00:00 2018-11-14T19:00:00 NA
3 23566 2019-01-23T19:00:00 NA NA
4 23217 2018-12-20T19:00:00 2019-01-23T19:00:00 2019-01-18T19:00:00
5 23325 2018-05-25T18:00:00 2019-01-23T19:00:00 NA
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.