I am trying to parse a large XML file into an R data frame. The structure of the XML is uneven and does not always contain all elements and sometimes contains more than 1 duplicated element per node.
The XML is:
<root>
<members>
<member>
<id>1</id>
<educations>
<education>
<institution>Sydney University</institution>
<program>Masters of Science</program>
<start-date>2010</start-date>
<end-date>2015</end-date>
<description></description>
</education>
<education>
<institution>UTS</institution>
<program>Bachelor of Science</program>
<start-date>2004</start-date>
<end-date>2008</end-date>
</education>
</educations>
</member>
<member>
<id>2</id>
</member>
<member>
<id>3</id>
<educations>
<education>
<is-current>true</is-current>
<institution>Monash Univeristy</institution>
<start-date>2010</start-date>
</education>
</educations>
</member>
</members>
</root>
Desired output table would have duplicated IDs for each member and their education blocks. So ID 1 would have 2 rows for each education period and ID 3 would have just 1.
Using xmlToList() creates excessive columns and I can't find a way to duplicate the ID for each child node.
This is an admittedly clumsy solution, possibly there are far more elegant tidiverse-esque solutions. However, this seems to do the job.
library(XML)
library(plyr)
names_use <- c("institution", "program", "start-date", "end-date","description")
list_xml <- xmlToList(test)
df_use <- ldply(list_xml$member, function(x){
if(is.null(x$educations)){
df_edu <- data.frame(x$id,t(rep(NA,5)))
names(df_edu) <- c("id",names_use)
return(df_edu)
}
df_res <- ldply(x$educations, function(edu_tmp){
df_edu <- as.data.frame(t(unlist(edu_tmp)),
stringsAsFactors = F)
for(i_names in names_use){
if(!i_names %in% names(df_edu)){
df_edu[,i_names] <- NA
}
}
return(df_edu)
})
df_res$id <- x$id
return(df_res[,c("id",names_use)])
})
df_use <- df_use[,c("id",names_use)]
df_use
id institution program start-date end-date description
1 1 Sydney University Masters of Science 2010 2015 NA
2 1 UTS Bachelor of Science 2004 2008 NA
3 2 <NA> <NA> <NA> <NA> NA
4 3 Monash Univeristy <NA> 2010 <NA> NA
An alternate approach:
library(xml2)
library(tidyverse)
I like tidy column names so we'll add in a helper function:
mgca <- function(tbl) {
x <- colnames(tbl)
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
x <- make.unique(x, sep = "_")
colnames(tbl) <- x
tbl
}
doc <- read_xml("so.xml")
The idea here is to first iterate over each <member>
, then extract the <id>
for it.
Once inside a <member>
see if we have any children. If not, just return the <id>
in a data frame. If we do, then further iterate across each <education>
node, identifying the children present and only pulling those out and making a data frame for each of them, including the <id>
, finally smushing it all together into a final data frame after cleaning up column names and getting better column types:
xml_find_all(doc, ".//member") %>%
map_df(~{
id <- (xml_find_first(.x, ".//id") %>% xml_text()) %||% NA_character_
edus <- xml_find_all(.x, ".//educations/education")
if (length(edus) > 0) {
map_df(edus, ~{
kid <- .x
nodes <- xml_children(kid) %>% xml_name()
map(nodes, ~xml_find_first(kid, sprintf(".//%s", .x)) %>%
xml_text()) %>%
set_names(nodes) %>%
append(list(id = id)) %>%
flatten_df()
})
} else {
data_frame(id = id)
}
}) %>%
mgca() %>%
type_convert()
## # A tibble: 4 x 7
## institution program start_date end_date description id is_current
## <chr> <chr> <int> <int> <chr> <int> <chr>
## 1 Sydney University Masters of Science 2010 2015 <NA> 1 <NA>
## 2 UTS Bachelor of Science 2004 2008 <NA> 1 <NA>
## 3 <NA> <NA> NA NA <NA> 2 <NA>
## 4 Monash Univeristy <NA> 2010 NA <NA> 3 true
Since type_convert()
can't read minds, you'll likely have to turn is_current
into a logical vector on your own.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.