简体   繁体   English

将不均匀的XML解析为R data.frame

[英]Parsing uneven XML into R data.frame

I am trying to parse a large XML file into an R data frame. 我正在尝试将大型XML文件解析为R数据帧。 The structure of the XML is uneven and does not always contain all elements and sometimes contains more than 1 duplicated element per node. XML的结构参差不齐,并不总是包含所有元素,有时每个节点包含1个以上的重复元素。

The XML is: XML是:

<root>
<members>
<member>
  <id>1</id>
  <educations>
    <education>
      <institution>Sydney University</institution>
      <program>Masters of Science</program>
      <start-date>2010</start-date>
      <end-date>2015</end-date>
      <description></description>
    </education>
    <education>
      <institution>UTS</institution>
      <program>Bachelor of Science</program>
      <start-date>2004</start-date>
      <end-date>2008</end-date>
    </education>
  </educations>
</member>

<member>
  <id>2</id>
 </member>

<member>
  <id>3</id>
  <educations>
    <education>
      <is-current>true</is-current>
      <institution>Monash Univeristy</institution>
      <start-date>2010</start-date>
    </education>
  </educations>
</member>
</members>
</root>

Desired output table would have duplicated IDs for each member and their education blocks. 期望的输出表将为每个成员及其教育块提供重复的ID。 So ID 1 would have 2 rows for each education period and ID 3 would have just 1. 因此,每个教育阶段ID 1将具有2行,ID 3将仅具有1行。

Using xmlToList() creates excessive columns and I can't find a way to duplicate the ID for each child node. 使用xmlToList()会创建过多的列,但我找不到为每个子节点复制ID的方法。

This is an admittedly clumsy solution, possibly there are far more elegant tidiverse-esque solutions. 这是一个公认的笨拙的解决方案,可能还有更优雅的tidiverse式解决方案。 However, this seems to do the job. 但是,这似乎可以解决问题。

library(XML)
library(plyr)

names_use <- c("institution", "program", "start-date", "end-date","description")
list_xml <- xmlToList(test)
df_use <- ldply(list_xml$member, function(x){
    if(is.null(x$educations)){
        df_edu <- data.frame(x$id,t(rep(NA,5)))
        names(df_edu) <- c("id",names_use)
        return(df_edu)
    }
    df_res <- ldply(x$educations, function(edu_tmp){
        df_edu <- as.data.frame(t(unlist(edu_tmp)),
            stringsAsFactors = F)
        for(i_names in names_use){
            if(!i_names %in% names(df_edu)){
                df_edu[,i_names] <- NA
            }
        }
        return(df_edu)
    })
    df_res$id <- x$id
    return(df_res[,c("id",names_use)])
})
df_use <- df_use[,c("id",names_use)]

df_use
  id       institution             program start-date end-date description
1  1 Sydney University  Masters of Science       2010     2015          NA
2  1               UTS Bachelor of Science       2004     2008          NA
3  2              <NA>                <NA>       <NA>     <NA>          NA
4  3 Monash Univeristy                <NA>       2010     <NA>          NA

An alternate approach: 另一种方法:

library(xml2)
library(tidyverse)

I like tidy column names so we'll add in a helper function: 我喜欢整洁的列名,因此我们将添加一个辅助函数:

mgca <- function(tbl) {

  x <- colnames(tbl)
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")

  colnames(tbl) <- x

  tbl

}

doc <- read_xml("so.xml")

The idea here is to first iterate over each <member> , then extract the <id> for it. 这里的想法是首先遍历每个<member> ,然后为其提取<id>

Once inside a <member> see if we have any children. 进入<member>看看我们是否有孩子。 If not, just return the <id> in a data frame. 如果不是,则只需在数据框中返回<id> If we do, then further iterate across each <education> node, identifying the children present and only pulling those out and making a data frame for each of them, including the <id> , finally smushing it all together into a final data frame after cleaning up column names and getting better column types: 如果这样做,则进一步遍历每个<education>节点,识别出存在的子节点,然后仅将其拉出并为每个子节点创建一个数据帧,包括<id> ,最后将所有子节点拖到最后一个数据帧中清理列名并获得更好的列类型:

xml_find_all(doc, ".//member") %>% 
  map_df(~{

    id <- (xml_find_first(.x, ".//id") %>% xml_text()) %||% NA_character_

    edus <- xml_find_all(.x, ".//educations/education")

    if (length(edus) > 0) {

      map_df(edus, ~{
        kid <- .x
        nodes <- xml_children(kid) %>% xml_name()
        map(nodes, ~xml_find_first(kid, sprintf(".//%s", .x)) %>% 
              xml_text()) %>% 
          set_names(nodes) %>% 
          append(list(id = id)) %>% 
          flatten_df() 
      })

    } else {
      data_frame(id = id)
    }

  }) %>% 
  mgca() %>% 
  type_convert()
## # A tibble: 4 x 7
##         institution             program start_date end_date description    id is_current
##               <chr>               <chr>      <int>    <int>       <chr> <int>      <chr>
## 1 Sydney University  Masters of Science       2010     2015        <NA>     1       <NA>
## 2               UTS Bachelor of Science       2004     2008        <NA>     1       <NA>
## 3              <NA>                <NA>         NA       NA        <NA>     2       <NA>
## 4 Monash Univeristy                <NA>       2010       NA        <NA>     3       true

Since type_convert() can't read minds, you'll likely have to turn is_current into a logical vector on your own. 由于type_convert()无法读懂思想,因此您可能必须自己将is_current转换为逻辑向量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM